Thread: Need help with phys backed shm segments (Postgresql+FreeBSD).
On FreeBSD 4.1.1 and above there's a sysctl tunable called kern.ipc.shm_use_phys, when set to 1 it's supposed to make the kernel's handling of shared memory much more effecient at the expense or making the shm segment unpageable. I tried to use this option with 7.0.3 and FreeBSD 4.2 but for some reason spinlocks keep getting mucked up (there's a log at the tail end of this message). Anyone using Postgresql on FreeBSD probably wants this to work, otherwise using extremely large chunks of shm and many backends active can exhaust kernel memory. I was wondering if any of the more experienced developers could take a look at what's happenening here. Here's the log, the number in parens is the address of the lock, on tas() the value printed to the right is the value in _ret, for the others, it's the value before the lock count is set. S_INIT_LOCK: (0x30048008) -> 0 S_UNLOCK: (0x30048008) -> 0 S_INIT_LOCK: (0x3004800c) -> 0 S_UNLOCK: (0x3004800c) -> 0 S_INIT_LOCK: (0x30048010) -> 0 S_UNLOCK: (0x30048010) -> 0 S_INIT_LOCK: (0x30048011) -> 0 S_UNLOCK: (0x30048011) -> 0 S_INIT_LOCK: (0x30048012) -> 0 S_UNLOCK: (0x30048012) -> 0 S_INIT_LOCK: (0x30048018) -> 0 S_UNLOCK: (0x30048018) -> 0 S_INIT_LOCK: (0x3004801c) -> 0 S_UNLOCK: (0x3004801c) -> 0 S_INIT_LOCK: (0x3004801d) -> 1 S_UNLOCK: (0x3004801d) -> 1 S_INIT_LOCK: (0x3004801e) -> 0 S_UNLOCK: (0x3004801e) -> 0 S_INIT_LOCK: (0x30048024) -> 127 S_UNLOCK: (0x30048024) -> 127 S_INIT_LOCK: (0x30048028) -> 255 S_UNLOCK: (0x30048028) -> 255 S_INIT_LOCK: (0x30048029) -> 0 S_UNLOCK: (0x30048029) -> 0 S_INIT_LOCK: (0x3004802a) -> 0 S_UNLOCK: (0x3004802a) -> 0 S_INIT_LOCK: (0x30048030) -> 1 S_UNLOCK: (0x30048030) -> 1 S_INIT_LOCK: (0x30048034) -> 0 S_UNLOCK: (0x30048034) -> 0 S_INIT_LOCK: (0x30048035) -> 0 S_UNLOCK: (0x30048035) -> 0 S_INIT_LOCK: (0x30048036) -> 0 S_UNLOCK: (0x30048036) -> 0 S_INIT_LOCK: (0x3004803c) -> 50 S_UNLOCK: (0x3004803c) -> 50 S_INIT_LOCK: (0x30048040) -> 10 S_UNLOCK: (0x30048040) -> 10 S_INIT_LOCK: (0x30048041) -> 0 S_UNLOCK: (0x30048041) -> 0 S_INIT_LOCK: (0x30048042) -> 0 S_UNLOCK: (0x30048042) -> 0 S_INIT_LOCK: (0x30048048) -> 1 S_UNLOCK: (0x30048048) -> 1 S_INIT_LOCK: (0x3004804c) -> 80 S_UNLOCK: (0x3004804c) -> 80 S_INIT_LOCK: (0x3004804d) -> 1 S_UNLOCK: (0x3004804d) -> 1 S_INIT_LOCK: (0x3004804e) -> 0 S_UNLOCK: (0x3004804e) -> 0 S_INIT_LOCK: (0x30048054) -> 0 S_UNLOCK: (0x30048054) -> 0 S_INIT_LOCK: (0x30048058) -> 1 S_UNLOCK: (0x30048058) -> 1 S_INIT_LOCK: (0x30048059) -> 1 S_UNLOCK: (0x30048059) -> 1 S_INIT_LOCK: (0x3004805a) -> 0 S_UNLOCK: (0x3004805a) -> 0 S_INIT_LOCK: (0x30048060) -> 0 S_UNLOCK: (0x30048060) -> 0 S_INIT_LOCK: (0x30048064) -> 0 S_UNLOCK: (0x30048064) -> 0 S_INIT_LOCK: (0x30048065) -> 0 S_UNLOCK: (0x30048065) -> 0 S_INIT_LOCK: (0x30048066) -> 0 S_UNLOCK: (0x30048066) -> 0 S_INIT_LOCK: (0x3004806c) -> 0 S_UNLOCK: (0x3004806c) -> 0 S_INIT_LOCK: (0x30048070) -> 0 S_UNLOCK: (0x30048070) -> 0 S_INIT_LOCK: (0x30048071) -> 0 S_UNLOCK: (0x30048071) -> 0 S_INIT_LOCK: (0x30048072) -> 0 S_UNLOCK: (0x30048072) -> 0 S_INIT_LOCK: (0x30048078) -> 0 S_UNLOCK: (0x30048078) -> 0 S_INIT_LOCK: (0x3004807c) -> 0 S_UNLOCK: (0x3004807c) -> 0 S_INIT_LOCK: (0x3004807d) -> 0 S_UNLOCK: (0x3004807d) -> 0 S_INIT_LOCK: (0x3004807e) -> 0 S_UNLOCK: (0x3004807e) -> 0 tas (0x30048054) -> 0 tas (0x30048059) -> 0 tas (0x30048058) -> 0 S_UNLOCK: (0x30048054) -> 1 tas (0x30048048) -> 0 tas (0x3004804d) -> 0 tas (0x3004804c) -> 0 S_UNLOCK: (0x30048048) -> 1 tas (0x30048048) -> 0 S_UNLOCK: (0x3004804c) -> 1 S_UNLOCK: (0x3004804d) -> 1 S_UNLOCK: (0x30048048) -> 1 tas (0x30048048) -> 0 tas (0x3004804d) -> 0 tas (0x3004804c) -> 0 S_UNLOCK: (0x30048048) -> 1 tas (0x30048048) -> 0 S_UNLOCK: (0x3004804c) -> 1 S_UNLOCK: (0x3004804d) -> 1 S_UNLOCK: (0x30048048) -> 1 tas (0x30048048) -> 0 tas (0x3004804d) -> 4 tas (0x3004804d) -> 1 tas (0x3004804d) -> 1 tas (0x3004804d) -> 1 tas (0x3004804d) -> 1 tas (0x3004804d) -> 1 tas (0x3004804d) -> 1 tas (0x3004804d) -> 1 repeats (it's stuck) -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
Alfred Perlstein <bright@wintelcom.net> writes: > Here's the log, the number in parens is the address of the lock, > on tas() the value printed to the right is the value in _ret, > for the others, it's the value before the lock count is set. This looks to be the trace of a SpinAcquire() (see src/backend/storage/ipc/spin.c): > tas (0x30048048) -> 0 > tas (0x3004804d) -> 0 > tas (0x3004804c) -> 0 > S_UNLOCK: (0x30048048) -> 1 followed by SpinRelease(): > tas (0x30048048) -> 0 > S_UNLOCK: (0x3004804c) -> 1 > S_UNLOCK: (0x3004804d) -> 1 > S_UNLOCK: (0x30048048) -> 1 followed by a failed attempt to reacquire the same SLock: > tas (0x30048048) -> 0 > tas (0x3004804d) -> 4 > tas (0x3004804d) -> 1 > tas (0x3004804d) -> 1 > tas (0x3004804d) -> 1 > tas (0x3004804d) -> 1 And that looks completely broken :-( ... something's clobbered the exlock field of the SLock struct, apparently. Are you sure this kernel feature you're trying to use actually works? BTW, if you're wondering why an SLock needs to contain *three* hardware spinlocks, the answer is that it doesn't. This code has been greatly simplified in current sources... regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [001205 07:43] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > > Here's the log, the number in parens is the address of the lock, > > on tas() the value printed to the right is the value in _ret, > > for the others, it's the value before the lock count is set. > > This looks to be the trace of a SpinAcquire() > (see src/backend/storage/ipc/spin.c): Yes, those are my debug printfs :). > > tas (0x30048048) -> 0 > > tas (0x3004804d) -> 0 > > tas (0x3004804c) -> 0 > > S_UNLOCK: (0x30048048) -> 1 > > followed by SpinRelease(): > > > tas (0x30048048) -> 0 > > S_UNLOCK: (0x3004804c) -> 1 > > S_UNLOCK: (0x3004804d) -> 1 > > S_UNLOCK: (0x30048048) -> 1 > > followed by a failed attempt to reacquire the same SLock: > > > tas (0x30048048) -> 0 > > tas (0x3004804d) -> 4 > > tas (0x3004804d) -> 1 > > tas (0x3004804d) -> 1 > > tas (0x3004804d) -> 1 > > tas (0x3004804d) -> 1 > > And that looks completely broken :-( ... something's clobbered the > exlock field of the SLock struct, apparently. Are you sure this > kernel feature you're trying to use actually works? No I'm not sure actually. :) I'll look into it further, but I was wondering if there was something I could do to debug the locks better. I think I'll add some S_MAGIC or something in the struct to see if the whole thing is getting clobbered or what... If you have any suggestions let me know. > BTW, if you're wondering why an SLock needs to contain *three* > hardware spinlocks, the answer is that it doesn't. This code has > been greatly simplified in current sources... It did look a bit strange... -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
BTW, I just remembered that in 7.0.*, the SLocks that are managed by SpinAcquire() all live in their own little shm segment. On a machine where slock_t is char, it'd likely only amount to 128 bytes or so. Maybe you are seeing some bug in FreeBSD's handling of tiny shm segments? regards, tom lane
Alfred Perlstein <bright@wintelcom.net> writes: > No I'm not sure actually. :) I'll look into it further, but I > was wondering if there was something I could do to debug the > locks better. I think I'll add some S_MAGIC or something in > the struct to see if the whole thing is getting clobbered or > what... If you have any suggestions let me know. Seems like a plan. In current sources I have moved the SLock struct declaration out of header files and into spin.c; it doesn't really need to be known anywhere else. You could probably do the same in 7.0.*, which would greatly simplify changing the struct around to see what's happening. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [001205 08:37] wrote: > BTW, I just remembered that in 7.0.*, the SLocks that are managed by > SpinAcquire() all live in their own little shm segment. On a machine > where slock_t is char, it'd likely only amount to 128 bytes or so. > Maybe you are seeing some bug in FreeBSD's handling of tiny shm > segments? Good call, i think I found it! :) -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
* Alfred Perlstein <bright@wintelcom.net> [001205 12:30] wrote: > * Tom Lane <tgl@sss.pgh.pa.us> [001205 08:37] wrote: > > BTW, I just remembered that in 7.0.*, the SLocks that are managed by > > SpinAcquire() all live in their own little shm segment. On a machine > > where slock_t is char, it'd likely only amount to 128 bytes or so. > > Maybe you are seeing some bug in FreeBSD's handling of tiny shm > > segments? > > Good call, i think I found it! :) Here's the patch I'm using on FreeBSD, it seems to work, if any other FreeBSD'ers want to try it out, just apply the patch: cd /usr/src/sys/vm ; patch < patchfile and recompile and boot with a new kernel, then do this: sysctl -w kern.ipc.shm_use_phys=1 or add: kern.ipc.shm_use_phys=1 to /etc/sysctl.conf Let me know if it works. thanks, -Alfred Index: phys_pager.c =================================================================== RCS file: /home/ncvs/src/sys/vm/phys_pager.c,v retrieving revision 1.3.2.1 diff -u -u -r1.3.2.1 phys_pager.c --- phys_pager.c 2000/08/04 22:31:11 1.3.2.1 +++ phys_pager.c 2000/12/05 20:13:25 @@ -83,7 +83,7 @@ * Allocate object and associate it with the pager. */ object = vm_object_allocate(OBJT_PHYS, - OFF_TO_IDX(foff + size)); + OFF_TO_IDX(foff + PAGE_MASK + size)); object->handle = handle; TAILQ_INSERT_TAIL(&phys_pager_object_list,object, pager_object_list);
Alfred, do you have any numbers with and without your patch ? I mean performance. You may use pg_check utility. Oleg On Tue, 5 Dec 2000, Alfred Perlstein wrote: > Date: Tue, 5 Dec 2000 13:04:45 -0800 > From: Alfred Perlstein <bright@wintelcom.net> > To: Tom Lane <tgl@sss.pgh.pa.us> > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Need help with phys backed shm segments (Postgresql+FreeBSD). > > * Alfred Perlstein <bright@wintelcom.net> [001205 12:30] wrote: > > * Tom Lane <tgl@sss.pgh.pa.us> [001205 08:37] wrote: > > > BTW, I just remembered that in 7.0.*, the SLocks that are managed by > > > SpinAcquire() all live in their own little shm segment. On a machine > > > where slock_t is char, it'd likely only amount to 128 bytes or so. > > > Maybe you are seeing some bug in FreeBSD's handling of tiny shm > > > segments? > > > > Good call, i think I found it! :) > > Here's the patch I'm using on FreeBSD, it seems to work, if any > other FreeBSD'ers want to try it out, just apply the patch: > cd /usr/src/sys/vm ; patch < patchfile > > and recompile and boot with a new kernel, then do this: > > sysctl -w kern.ipc.shm_use_phys=1 > > or add: > kern.ipc.shm_use_phys=1 > to /etc/sysctl.conf > > Let me know if it works. > > thanks, > -Alfred > > Index: phys_pager.c > =================================================================== > RCS file: /home/ncvs/src/sys/vm/phys_pager.c,v > retrieving revision 1.3.2.1 > diff -u -u -r1.3.2.1 phys_pager.c > --- phys_pager.c 2000/08/04 22:31:11 1.3.2.1 > +++ phys_pager.c 2000/12/05 20:13:25 > @@ -83,7 +83,7 @@ > * Allocate object and associate it with the pager. > */ > object = vm_object_allocate(OBJT_PHYS, > - OFF_TO_IDX(foff + size)); > + OFF_TO_IDX(foff + PAGE_MASK + size)); > object->handle = handle; > TAILQ_INSERT_TAIL(&phys_pager_object_list, object, > pager_object_list); > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Just as interesting On Tue, 5 Dec 2000, Alfred Perlstein wrote: > * Alfred Perlstein <bright@wintelcom.net> [001205 12:30] wrote: > > * Tom Lane <tgl@sss.pgh.pa.us> [001205 08:37] wrote: > > > BTW, I just remembered that in 7.0.*, the SLocks that are managed by > > > SpinAcquire() all live in their own little shm segment. On a machine > > > where slock_t is char, it'd likely only amount to 128 bytes or so. > > > Maybe you are seeing some bug in FreeBSD's handling of tiny shm > > > segments? > > > > Good call, i think I found it! :) > > Here's the patch I'm using on FreeBSD, it seems to work, if any > other FreeBSD'ers want to try it out, just apply the patch: > cd /usr/src/sys/vm ; patch < patchfile > > and recompile and boot with a new kernel, then do this: > > sysctl -w kern.ipc.shm_use_phys=1 > > or add: > kern.ipc.shm_use_phys=1 > to /etc/sysctl.conf > > Let me know if it works. > > thanks, > -Alfred > > Index: phys_pager.c > =================================================================== > RCS file: /home/ncvs/src/sys/vm/phys_pager.c,v > retrieving revision 1.3.2.1 > diff -u -u -r1.3.2.1 phys_pager.c > --- phys_pager.c 2000/08/04 22:31:11 1.3.2.1 > +++ phys_pager.c 2000/12/05 20:13:25 > @@ -83,7 +83,7 @@ > * Allocate object and associate it with the pager. > */ > object = vm_object_allocate(OBJT_PHYS, > - OFF_TO_IDX(foff + size)); > + OFF_TO_IDX(foff + PAGE_MASK + size)); > object->handle = handle; > TAILQ_INSERT_TAIL(&phys_pager_object_list, object, > pager_object_list); > > Randy Jonasz Software Engineer Click2net Inc. Web: http://www.click2net.com Phone: (905) 271-3550 "You cannot possibly pay a philosopher what he's worth, but try your best" -- Aristotle
* Oleg Bartunov <oleg@sai.msu.su> [001205 13:33] wrote: > Alfred, > > do you have any numbers with and without your patch ? > I mean performance. You may use pg_check utility. Er, I just made the patch a couple of hours ago, and I'm also dealing with some other FreeBSD issues right now. I will report on it as soon as I can. Theoretically You'll only see performance gains when doing fork(), the real intent here is to allow for giant segments, without kern.ipc.shm_use_phys=1 running let's say 768meg (out of 1gig) shared memory segments will probably cause performance problems because of the amount of swap structures needed per-process to manage swappable segments. I'm going to be enabling this on one of our boxes and see if it makes a noticeable difference. I'll let you guys know. > > Date: Tue, 5 Dec 2000 13:04:45 -0800 > > From: Alfred Perlstein <bright@wintelcom.net> > > To: Tom Lane <tgl@sss.pgh.pa.us> > > Cc: pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] Need help with phys backed shm segments (Postgresql+FreeBSD). > > > > Here's the patch I'm using on FreeBSD, it seems to work, if any > > other FreeBSD'ers want to try it out, just apply the patch: > > cd /usr/src/sys/vm ; patch < patchfile > > > > and recompile and boot with a new kernel, then do this: > > > > sysctl -w kern.ipc.shm_use_phys=1 > > > > or add: > > kern.ipc.shm_use_phys=1 > > to /etc/sysctl.conf > > > > Let me know if it works. > > > > thanks, > > -Alfred