Thread: Wierd context-switching issue on Xeon
Folks, We're seeing some odd issues with hyperthreading-capable Xeons, whether or not hyperthreading is enabled. Basically, when a small number of really-heavy duty queries hit the system and push all of the CPUs to more than 70% used (about 1/2 user & 1/2 kernel), the system goes to 100,000+ context switcthes per second and performance degrades. I know that there's other Xeon users on this list ... has anyone else seen anything like that? The machines are Dells running Red Hat 7.3. -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > We're seeing some odd issues with hyperthreading-capable Xeons, whether or not > hyperthreading is enabled. Basically, when a small number of really-heavy > duty queries hit the system and push all of the CPUs to more than 70% used > (about 1/2 user & 1/2 kernel), the system goes to 100,000+ context switcthes > per second and performance degrades. Strictly a WAG ... but what this sounds like to me is disastrously bad behavior of the spinlock code under heavy contention. We thought we'd fixed the spinlock code for SMP machines awhile ago, but maybe hyperthreading opens some new vistas for misbehavior ... > I know that there's other Xeon users on this list ... has anyone else seen > anything like that? The machines are Dells running Red Hat 7.3. What Postgres version? Is it easy for you to try 7.4? If we were really lucky, the random-backoff algorithm added late in 7.4 development would cure this. If you can't try 7.4, or want to gather more data first, it would be good to try to confirm or disprove the theory that the context switches are coming from spinlock delays. If they are, they'd be coming from the select() calls in s_lock() in s_lock.c. Can you strace or something to see what kernel calls the context switches occur on? Another line of thought is that RH 7.3 is a long ways back, and it wasn't so very long ago that Linux still had lots of SMP bugs. Maybe what you really need is a kernel update? regards, tom lane
Tom, > Strictly a WAG ... but what this sounds like to me is disastrously bad > behavior of the spinlock code under heavy contention. We thought we'd > fixed the spinlock code for SMP machines awhile ago, but maybe > hyperthreading opens some new vistas for misbehavior ... Yeah, I thought of that based on the discussion on -Hackers. But we tried turning off hyperthreading, with no change in behavior. > If you can't try 7.4, or want to gather more data first, it would be > good to try to confirm or disprove the theory that the context switches > are coming from spinlock delays. If they are, they'd be coming from the > select() calls in s_lock() in s_lock.c. Can you strace or something to > see what kernel calls the context switches occur on? Might be worth it ... will suggest that. Will also try 7.4. -- -Josh Berkus Aglio Database Solutions San Francisco
Tom, Josh, I think we have the problem resolved after I found the following note from Tom: > A large number of semops may mean that you have excessive contention on some lockable > resource, but I don't have enough info to guess what resource. This was the key to look at: we were missing all indices on table which is used heavily and does lots of locking. After recreating the missing indices the production system performed normal. No, more excessive semop() calls, load way below 1.0, CS over 20.000 very rare, more in thousands realm and less. This is quite a relief but I am sorry that the problem was so stupid and you wasted some time although Tom said he had also seem excessive semop() calls on another Dual XEON system. Hyperthreading was turned off so far but will be turned on again the next days. I don't expect any problems then. I'm not sure if this semop() problem is still an issue but the database behaves a bit out of bounds in this situation, i.e. consuming system resources with semop() calls 95% while tables are locked very often and longer. Thanks for your help, Dirk At last here is the current vmstat 1 excerpt where the problem has been resolved: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 2308 232508 201924 6976532 0 0 136 464 628 812 5 1 94 0 0 0 2308 232500 201928 6976628 0 0 96 296 495 484 4 0 95 0 0 1 2308 232492 201928 6976628 0 0 0 176 347 278 1 0 99 0 0 0 2308 233484 201928 6976596 0 0 40 580 443 351 8 2 90 0 1 0 2308 233484 201928 6976696 0 0 76 692 792 651 9 2 88 0 0 0 2308 233484 201928 6976696 0 0 0 20 132 34 0 0 100 0 0 0 2308 233484 201928 6976696 0 0 0 76 177 90 0 0 100 0 0 1 2308 233484 201928 6976696 0 0 0 216 321 250 4 0 96 0 0 0 2308 233484 201928 6976696 0 0 0 116 417 240 8 0 92 0 0 0 2308 233484 201928 6976784 0 0 48 600 403 270 8 0 92 0 0 0 2308 233464 201928 6976860 0 0 76 452 1064 2611 14 1 84 0 0 0 2308 233460 201932 6976900 0 0 32 256 587 587 12 1 87 0 0 0 2308 233460 201932 6976932 0 0 32 188 379 287 5 0 94 0 0 0 2308 233460 201932 6976932 0 0 0 0 103 8 0 0 100 0 0 0 2308 233460 201932 6976932 0 0 0 0 102 14 0 0 100 0 0 1 2308 233444 201948 6976932 0 0 0 348 300 180 1 0 99 0 1 0 2308 233424 201948 6976948 0 0 16 380 739 906 4 2 93 0 0 0 2308 233424 201948 6977032 0 0 68 260 724 987 7 0 92 0 0 0 2308 231924 201948 6977128 0 0 96 344 1130 753 11 1 88 0 1 0 2308 231924 201948 6977248 0 0 112 324 687 628 3 0 97 0 0 0 2308 231924 201948 6977248 0 0 0 192 575 430 5 0 95 0 1 0 2308 231924 201948 6977248 0 0 0 264 208 124 0 0 100 0 0 0 2308 231924 201948 6977264 0 0 16 272 380 230 3 2 95 0 0 0 2308 231924 201948 6977264 0 0 0 0 104 8 0 0 100 0 0 0 2308 231924 201948 6977264 0 0 0 48 258 92 1 0 99 0 0 0 2308 231816 201948 6977484 0 0 212 268 456 384 2 0 98 0 0 0 2308 231816 201948 6977484 0 0 0 88 453 770 0 0 99 0 0 0 2308 231452 201948 6977680 0 0 196 476 615 676 5 0 94 0 0 0 2308 231452 201948 6977680 0 0 0 228 431 400 2 0 98 0 0 0 2308 231452 201948 6977680 0 0 0 0 237 58 3 0 97 0 0 0 2308 231448 201952 6977680 0 0 0 0 365 84 2 0 97 0 0 0 2308 231448 201952 6977680 0 0 0 40 246 108 1 0 99 0 0 0 2308 231448 201952 6977776 0 0 96 352 606 1026 4 2 94 0 0 0 2308 231448 201952 6977776 0 0 0 240 295 266 5 0 95 0
=?ISO-8859-1?Q?Dirk_Lutzeb=E4ck?= <lutzeb@aeccom.com> writes: > This was the key to look at: we were missing all indices on table which > is used heavily and does lots of locking. After recreating the missing > indices the production system performed normal. No, more excessive > semop() calls, load way below 1.0, CS over 20.000 very rare, more in > thousands realm and less. Hmm ... that's darn interesting. AFAICT the test case I am looking at for Josh's client has no such SQL-level problem ... but I will go back and double check ... regards, tom lane
Dirk, > I'm not sure if this semop() problem is still an issue but the database > behaves a bit out of bounds in this situation, i.e. consuming system > resources with semop() calls 95% while tables are locked very often and > longer. It would be helpful to us if you could test this with the indexes disabled on the non-Bigmem system. I'd like to eliminate Bigmem as a factor, if possible. -- -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Enterprise vertical business josh@agliodbs.com and data analysis solutions (415) 752-2387 and database optimization fax 651-9224 utilizing Open Source technology San Francisco
After some further digging I think I'm starting to understand what's up here, and the really fundamental answer is that a multi-CPU Xeon MP box sucks for running Postgres. I did a bunch of oprofile measurements on a machine belonging to one of Josh's clients, using a test case that involved heavy concurrent access to a relatively small amount of data (little enough to fit into Postgres shared buffers, so that no I/O or kernel calls were really needed once the test got going). I found that by nearly any measure --- elapsed time, bus transactions, or machine-clear events --- the spinlock acquisitions associated with grabbing and releasing the BufMgrLock took an unreasonable fraction of the time. I saw about 15% of elapsed time, 40% of bus transactions, and nearly 100% of pipeline-clear cycles going into what is essentially two instructions out of the entire backend. (Pipeline clears occur when the cache coherency logic detects a memory write ordering problem.) I am not completely clear on why this machine-level bottleneck manifests as a lot of context swaps at the OS level. I think what is happening is that because SpinLockAcquire is so slow, a process is much more likely than you'd normally expect to arrive at SpinLockAcquire while another process is also acquiring the spinlock. This puts the two processes into a "lockstep" condition where the second process is nearly certain to observe the BufMgrLock as locked, and be forced to suspend itself, even though the time the first process holds the BufMgrLock is not really very long at all. If you google for Xeon and "cache coherency" you'll find quite a bit of suggestive information about why this might be more true on the Xeon setup than others. A couple of interesting hits: http://www.theinquirer.net/?article=10797 says that Xeon MP uses a *slower* FSB than Xeon DP. This would translate directly to more time needed to transfer a dirty cache line from one processor to the other, which is the basic operation that we're talking about here. http://www.aceshardware.com/Spades/read.php?article_id=30000187 says that Opterons use a different cache coherency protocol that is fundamentally superior to the Xeon's, because dirty cache data can be transferred directly between two processor caches without waiting for main memory. So in the short term I think we have to tell people that Xeon MP is not the most desirable SMP platform to run Postgres on. (Josh thinks that the specific motherboard chipset being used in these machines might share some of the blame too. I don't have any evidence for or against that idea, but it's certainly possible.) In the long run, however, CPUs continue to get faster than main memory and the price of cache contention will continue to rise. So it seems that we need to give up the assumption that SpinLockAcquire is a cheap operation. In the presence of heavy contention it won't be. One thing we probably have got to do soon is break up the BufMgrLock into multiple finer-grain locks so that there will be less contention. However I am wary of doing this incautiously, because if we do it in a way that makes for a significant rise in the number of locks that have to be acquired to access a buffer, we might end up with a net loss. I think Neil Conway was looking into how the bufmgr might be restructured to reduce lock contention, but if he had come up with anything he didn't mention exactly what. Neil? regards, tom lane
So the the kernel/OS is irrelevant here ? this happens on any dual xeon? What about hypterthreading does it still happen if HTT is turned off ? Dave On Sun, 2004-04-18 at 17:47, Tom Lane wrote: > After some further digging I think I'm starting to understand what's up > here, and the really fundamental answer is that a multi-CPU Xeon MP box > sucks for running Postgres. > > I did a bunch of oprofile measurements on a machine belonging to one of > Josh's clients, using a test case that involved heavy concurrent access > to a relatively small amount of data (little enough to fit into Postgres > shared buffers, so that no I/O or kernel calls were really needed once > the test got going). I found that by nearly any measure --- elapsed > time, bus transactions, or machine-clear events --- the spinlock > acquisitions associated with grabbing and releasing the BufMgrLock took > an unreasonable fraction of the time. I saw about 15% of elapsed time, > 40% of bus transactions, and nearly 100% of pipeline-clear cycles going > into what is essentially two instructions out of the entire backend. > (Pipeline clears occur when the cache coherency logic detects a memory > write ordering problem.) > > I am not completely clear on why this machine-level bottleneck manifests > as a lot of context swaps at the OS level. I think what is happening is > that because SpinLockAcquire is so slow, a process is much more likely > than you'd normally expect to arrive at SpinLockAcquire while another > process is also acquiring the spinlock. This puts the two processes > into a "lockstep" condition where the second process is nearly certain > to observe the BufMgrLock as locked, and be forced to suspend itself, > even though the time the first process holds the BufMgrLock is not > really very long at all. > > If you google for Xeon and "cache coherency" you'll find quite a bit of > suggestive information about why this might be more true on the Xeon > setup than others. A couple of interesting hits: > > http://www.theinquirer.net/?article=10797 > says that Xeon MP uses a *slower* FSB than Xeon DP. This would > translate directly to more time needed to transfer a dirty cache line > from one processor to the other, which is the basic operation that we're > talking about here. > > http://www.aceshardware.com/Spades/read.php?article_id=30000187 > says that Opterons use a different cache coherency protocol that is > fundamentally superior to the Xeon's, because dirty cache data can be > transferred directly between two processor caches without waiting for > main memory. > > So in the short term I think we have to tell people that Xeon MP is not > the most desirable SMP platform to run Postgres on. (Josh thinks that > the specific motherboard chipset being used in these machines might > share some of the blame too. I don't have any evidence for or against > that idea, but it's certainly possible.) > > In the long run, however, CPUs continue to get faster than main memory > and the price of cache contention will continue to rise. So it seems > that we need to give up the assumption that SpinLockAcquire is a cheap > operation. In the presence of heavy contention it won't be. > > One thing we probably have got to do soon is break up the BufMgrLock > into multiple finer-grain locks so that there will be less contention. > However I am wary of doing this incautiously, because if we do it in a > way that makes for a significant rise in the number of locks that have > to be acquired to access a buffer, we might end up with a net loss. > > I think Neil Conway was looking into how the bufmgr might be > restructured to reduce lock contention, but if he had come up with > anything he didn't mention exactly what. Neil? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > > > !DSPAM:4082feb7326901956819835! > > -- Dave Cramer 519 939 0336 ICQ # 14675561
Tom Lane <tgl@sss.pgh.pa.us> writes: > So in the short term I think we have to tell people that Xeon MP is not > the most desirable SMP platform to run Postgres on. (Josh thinks that > the specific motherboard chipset being used in these machines might > share some of the blame too. I don't have any evidence for or against > that idea, but it's certainly possible.) > > In the long run, however, CPUs continue to get faster than main memory > and the price of cache contention will continue to rise. So it seems > that we need to give up the assumption that SpinLockAcquire is a cheap > operation. In the presence of heavy contention it won't be. There's nothing about the way Postgres spinlocks are coded that affects this? Is it something the kernel could help with? I've been wondering whether there's any benefits postgres is missing out on by using its own hand-rolled locking instead of using the pthreads infrastructure that the kernel is often involved in. -- greg
Dave Cramer <pg@fastcrypt.com> writes: > So the the kernel/OS is irrelevant here ? this happens on any dual xeon? I believe so. The context-switch behavior might possibly be a little more pleasant on other kernels, but the underlying spinlock problem is not dependent on the kernel. > What about hypterthreading does it still happen if HTT is turned off ? The problem comes from keeping the caches synchronized between multiple physical CPUs. AFAICS enabling HTT wouldn't make it worse, because a hyperthreaded processor still only has one cache. regards, tom lane
Greg Stark <gsstark@mit.edu> writes: > There's nothing about the way Postgres spinlocks are coded that affects this? No. AFAICS our spinlock sequences are pretty much equivalent to the way the Linux kernel codes its spinlocks, so there's no deep dark knowledge to be mined there. We could possibly use some more-efficient blocking mechanism than semop() once we've decided we have to block (it's a shame Linux still doesn't have cross-process POSIX semaphores). But the striking thing I learned from looking at the oprofile results is that most of the inefficiency comes at the very first TAS() operation, before we've even "spun" let alone decided we have to block. The s_lock() subroutine does not account for more than a few percent of the runtime in these tests, compared to 15% at the inline TAS() operations in LWLockAcquire and LWLockRelease. I interpret this to mean that once it's acquired ownership of the cache line, a Xeon can get through the "spinning" loop in s_lock() mighty quickly. regards, tom lane
>> What about hypterthreading does it still happen if HTT is turned off ? > The problem comes from keeping the caches synchronized between multiple > physical CPUs. AFAICS enabling HTT wouldn't make it worse, because a > hyperthreaded processor still only has one cache. Also, I forgot to say that the numbers I'm quoting *are* with HTT off. regards, tom lane
Josh, I cannot reproduce the excessive semop() on a Dual XEON DP on a non-bigmem kernel, HT on. Interesting to know if the problem is related to XEON MP (as Tom wrote) or bigmem. Josh Berkus wrote: >Dirk, > > > >>I'm not sure if this semop() problem is still an issue but the database >>behaves a bit out of bounds in this situation, i.e. consuming system >>resources with semop() calls 95% while tables are locked very often and >>longer. >> >> > >It would be helpful to us if you could test this with the indexes disabled on >the non-Bigmem system. I'd like to eliminate Bigmem as a factor, if >possible. > > >
Here's an interesting link that suggests that hyperthreading would be much worse. http://groups.google.com/groups?q=hyperthreading+dual+xeon+idle&start=10&hl=en&lr=&ie=UTF-8&c2coff=1&selm=aukkonen-FE5275.21093624062003%40shawnews.gv.shawcable.net&rnum=16 another which has some hints as to how it should be handled http://groups.google.com/groups?q=hyperthreading+dual+xeon+idle&start=10&hl=en&lr=&ie=UTF-8&c2coff=1&selm=u5tl1XD3BHA.2760%40tkmsftngp04&rnum=19 FWIW, I have anecdotal evidence that suggests that this is the case, on of my clients was seeing very large context switches with HTT turned on, and without it was much better. Dave On Sun, 2004-04-18 at 23:19, Tom Lane wrote: > >> What about hypterthreading does it still happen if HTT is turned off ? > > > The problem comes from keeping the caches synchronized between multiple > > physical CPUs. AFAICS enabling HTT wouldn't make it worse, because a > > hyperthreaded processor still only has one cache. > > Also, I forgot to say that the numbers I'm quoting *are* with HTT off. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > > > > !DSPAM:40834781158911062514350! > > -- Dave Cramer 519 939 0336 ICQ # 14675561
What about quad-XEON setups? Could that be worse? (have dual, and quad setups both) Shall we re-consider XEON-MP CPU machineswith high cache (4MB+)? Very generally, what number would be considered high, especially, if it coincides with expected heavy load? Not sure a specific chipset was mentioned... Thanks, Anjan -----Original Message----- From: Greg Stark [mailto:gsstark@mit.edu] Sent: Sun 4/18/2004 8:40 PM To: Tom Lane Cc: lutzeb@aeccom.com; Josh Berkus; pgsql-performance@postgresql.org; Neil Conway Subject: Re: [PERFORM] Wierd context-switching issue on Xeon Tom Lane <tgl@sss.pgh.pa.us> writes: > So in the short term I think we have to tell people that Xeon MP is not > the most desirable SMP platform to run Postgres on. (Josh thinks that > the specific motherboard chipset being used in these machines might > share some of the blame too. I don't have any evidence for or against > that idea, but it's certainly possible.) > > In the long run, however, CPUs continue to get faster than main memory > and the price of cache contention will continue to rise. So it seems > that we need to give up the assumption that SpinLockAcquire is a cheap > operation. In the presence of heavy contention it won't be. There's nothing about the way Postgres spinlocks are coded that affects this? Is it something the kernel could help with? I've been wondering whether there's any benefits postgres is missing out on by using its own hand-rolled locking instead of using the pthreads infrastructure that the kernel is often involved in. -- greg ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Tom, > So in the short term I think we have to tell people that Xeon MP is not > the most desirable SMP platform to run Postgres on. (Josh thinks that > the specific motherboard chipset being used in these machines might > share some of the blame too. I don't have any evidence for or against > that idea, but it's certainly possible.) I have 3 reasons for thinking this: 1) the ServerWorks chipset is present in the fully documented cases that we have of this problem so far. This is notable becuase the SW is notorious for poor manufacturing quality, so much so that the company that made them is currently in receivership. These chips were so bad that Dell was forced to recall several hundred of it's 2650's, where the motherboards caught fire! 2) the main defect of the SW is the NorthBridge, which could conceivably adversely affect traffic between RAM and the processor cache. 3) XeonMP is a very popular platform thanks to Dell, and we are not seeing more problem reports than we are. The other thing I'd like your comment on, Tom, is that Dirk appears to have reported that when he installed a non-bigmem kernel, the issue went away. Dirk, is this correct? -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > The other thing I'd like your comment on, Tom, is that Dirk appears to have > reported that when he installed a non-bigmem kernel, the issue went away. > Dirk, is this correct? I'd be really surprised if that had anything to do with it. AFAIR Dirk's test changed more than one variable and so didn't prove a connection. regards, tom lane
I decided to check the context-switching behavior here for baseline since we have a rather diverse set of postgres server hardware, though nothing using Xeon MP that is also running a postgres instance, and everything looks normal under load. Some platforms are better than others, but nothing is outside of what I would consider normal bounds. Our biggest database servers are Opteron SMP systems, and these servers are particularly well-behaved under load with Postgres 7.4.2. If there is a problem with the locking code and context-switching, it sure isn't manifesting on our Opteron SMP systems. Under rare confluences of process interaction, we occasionally see short spikes in the 2-3,000 cs/sec range. It typically peaks at a couple hundred cs/sec under load. Obviously this is going to be a function of our load profile a certain extent. The Opterons have proven to be very good database hardware in general for us. j. andrew rogers
Josh Berkus wrote: > Tom, > > > So in the short term I think we have to tell people that Xeon MP is not > > the most desirable SMP platform to run Postgres on. (Josh thinks that > > the specific motherboard chipset being used in these machines might > > share some of the blame too. I don't have any evidence for or against > > that idea, but it's certainly possible.) > > I have 3 reasons for thinking this: > 1) the ServerWorks chipset is present in the fully documented cases that we > have of this problem so far. This is notable becuase the SW is notorious > for poor manufacturing quality, so much so that the company that made them is > currently in receivership. These chips were so bad that Dell was forced to > recall several hundred of it's 2650's, where the motherboards caught fire! > 2) the main defect of the SW is the NorthBridge, which could conceivably > adversely affect traffic between RAM and the processor cache. > 3) XeonMP is a very popular platform thanks to Dell, and we are not seeing > more problem reports than we are. > > The other thing I'd like your comment on, Tom, is that Dirk appears to have > reported that when he installed a non-bigmem kernel, the issue went away. I have BSD on a SuperMicro dual Xeon, so if folks want another hardware/OS combination to test, I can give out logins to my machine. http://candle.pha.pa.us/main/hardware.html -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, 19 Apr 2004, Bruce Momjian wrote: > Josh Berkus wrote: > > Tom, > > > > > So in the short term I think we have to tell people that Xeon MP is not > > > the most desirable SMP platform to run Postgres on. (Josh thinks that > > > the specific motherboard chipset being used in these machines might > > > share some of the blame too. I don't have any evidence for or against > > > that idea, but it's certainly possible.) > > > > I have 3 reasons for thinking this: > > 1) the ServerWorks chipset is present in the fully documented cases that we > > have of this problem so far. This is notable becuase the SW is notorious > > for poor manufacturing quality, so much so that the company that made them is > > currently in receivership. These chips were so bad that Dell was forced to > > recall several hundred of it's 2650's, where the motherboards caught fire! > > 2) the main defect of the SW is the NorthBridge, which could conceivably > > adversely affect traffic between RAM and the processor cache. > > 3) XeonMP is a very popular platform thanks to Dell, and we are not seeing > > more problem reports than we are. > > > > The other thing I'd like your comment on, Tom, is that Dirk appears to have > > reported that when he installed a non-bigmem kernel, the issue went away. > > I have BSD on a SuperMicro dual Xeon, so if folks want another > hardware/OS combination to test, I can give out logins to my machine. I can probably do some nighttime testing on a dual 2800MHz non-MP Xeon machine as well. It's a Dell 2600 series machine and very fast. It has the moderately fast 533MHz FSB so may not have as many problems as the MP type CPUs seem to be having.
scott.marlowe wrote: > On Mon, 19 Apr 2004, Bruce Momjian wrote: >>I have BSD on a SuperMicro dual Xeon, so if folks want another >>hardware/OS combination to test, I can give out logins to my machine. > > I can probably do some nighttime testing on a dual 2800MHz non-MP Xeon > machine as well. It's a Dell 2600 series machine and very fast. It has > the moderately fast 533MHz FSB so may not have as many problems as the MP > type CPUs seem to be having. I've got a quad 2.8Ghz MP Xeon (IBM x445) that I could test on. Does anyone have a test set that can reliably reproduce the problem? Joe
Joe, > I've got a quad 2.8Ghz MP Xeon (IBM x445) that I could test on. Does > anyone have a test set that can reliably reproduce the problem? Unfortunately we can't seem to come up with one. So far we have 2 machines that exhibit the issue, and their databases are highly confidential (State of WA education data). It does seem to require a database which is in the many GB (> 10GB), and a situation where a small subset of the data is getting hit repeatedly by multiple processes. So you could try your own data warehouse, making sure that you have at least 4 connections hitting one query after another. -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> I've got a quad 2.8Ghz MP Xeon (IBM x445) that I could test on. Does >> anyone have a test set that can reliably reproduce the problem? > Unfortunately we can't seem to come up with one. > It does seem to require a database which is in the many GB (> 10GB), and a > situation where a small subset of the data is getting hit repeatedly by > multiple processes. I do not think a large database is actually necessary; the test case Josh's client has is only hitting a relatively small amount of data. The trick seems to be to cause lots and lots of ReadBuffer/ReleaseBuffer activity without much else happening, and to do this from multiple backends concurrently. I believe the best way to make this happen is a lot of relatively simple (but not short) indexscan queries that in aggregate touch just a bit less than shared_buffers worth of data. I have not tried to make a self-contained test case, but based on what I know now I think it should be possible. I'll give this a shot later tonight --- it does seem that trying to reproduce the problem on different kinds of hardware is the next useful step we can take. regards, tom lane
Here is a test case. To set up, run the "test_setup.sql" script once; then launch two copies of the "test_run.sql" script. (For those of you with more than two CPUs, see whether you need one per CPU to make trouble, or whether two test_runs are enough.) Check that you get a nestloops-with-index-scans plan shown by the EXPLAIN in test_run. In isolation, test_run.sql should do essentially no syscalls at all once it's past the initial ramp-up. On a machine that's functioning per expectations, multiple copies of test_run show a relatively low rate of semop() calls --- a few per second, at most --- and maybe a delaying select() here and there. What I actually see on Josh's client's machine is a context swap storm: "vmstat 1" shows CS rates around 170K/sec. strace'ing the backends shows a corresponding rate of semop() syscalls, with a few delaying select()s sprinkled in. top(1) shows system CPU percent of 25-30 and idle CPU percent of 16-20. I haven't bothered to check how long the test_run query takes, but if it ends while you're still examining the behavior, just start it again. Note the test case assumes you've got shared_buffers set to at least 1000; with smaller values, you may get some I/O syscalls, which will probably skew the results. regards, tom lane drop table test_data; create table test_data(f1 int); insert into test_data values (random() * 100); insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; create index test_index on test_data(f1); vacuum verbose analyze test_data; checkpoint; -- force nestloop indexscan plan set enable_seqscan to 0; set enable_mergejoin to 0; set enable_hashjoin to 0; explain select count(*) from test_data a, test_data b, test_data c where a.f1 = b.f1 and b.f1 = c.f1; select count(*) from test_data a, test_data b, test_data c where a.f1 = b.f1 and b.f1 = c.f1;
I wrote: > Here is a test case. Hmmm ... I've been able to reproduce the CS storm on a dual Athlon, which seems to pretty much let the Xeon per se off the hook. Anybody got a multiple Opteron to try? Totally non-Intel CPUs? It would be interesting to see results with non-Linux kernels, too. regards, tom lane
Tom Lane wrote: > Here is a test case. To set up, run the "test_setup.sql" script once; > then launch two copies of the "test_run.sql" script. (For those of > you with more than two CPUs, see whether you need one per CPU to make > trouble, or whether two test_runs are enough.) Check that you get a > nestloops-with-index-scans plan shown by the EXPLAIN in test_run. Check. > In isolation, test_run.sql should do essentially no syscalls at all once > it's past the initial ramp-up. On a machine that's functioning per > expectations, multiple copies of test_run show a relatively low rate of > semop() calls --- a few per second, at most --- and maybe a delaying > select() here and there. > > What I actually see on Josh's client's machine is a context swap storm: > "vmstat 1" shows CS rates around 170K/sec. strace'ing the backends > shows a corresponding rate of semop() syscalls, with a few delaying > select()s sprinkled in. top(1) shows system CPU percent of 25-30 > and idle CPU percent of 16-20. Your test case works perfectly. I ran 4 concurrent psql sessions, on a quad Xeon (IBM x445, 2.8GHz, 4GB RAM), hyperthreaded. Heres what 'top' looks like: 177 processes: 173 sleeping, 3 running, 1 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 35.9% 0.0% 7.2% 0.0% 0.0% 0.0% 56.8% cpu00 19.6% 0.0% 4.9% 0.0% 0.0% 0.0% 75.4% cpu01 44.1% 0.0% 7.8% 0.0% 0.0% 0.0% 48.0% cpu02 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu03 32.3% 0.0% 13.7% 0.0% 0.0% 0.0% 53.9% cpu04 21.5% 0.0% 10.7% 0.0% 0.0% 0.0% 67.6% cpu05 42.1% 0.0% 9.8% 0.0% 0.0% 0.0% 48.0% cpu06 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% cpu07 27.4% 0.0% 10.7% 0.0% 0.0% 0.0% 61.7% Mem: 4123700k av, 3933896k used, 189804k free, 0k shrd, 221948k buff 2492124k actv, 760612k in_d, 41416k in_c Swap: 2040244k av, 5632k used, 2034612k free 3113272k cached Note that cpu06 is not a postgres process. The output of vmstat looks like this: # vmstat 1 procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 4 0 5632 184264 221948 3113308 0 0 0 0 0 0 0 0 0 0 3 0 5632 184264 221948 3113308 0 0 0 0 112 211894 36 9 55 0 5 0 5632 184264 221948 3113308 0 0 0 0 125 222071 39 8 53 0 4 0 5632 184264 221948 3113308 0 0 0 0 110 215097 39 10 52 0 1 0 5632 184588 221948 3113308 0 0 0 96 139 187561 35 10 55 0 3 0 5632 184588 221948 3113308 0 0 0 0 114 241731 38 10 52 0 3 0 5632 184920 221948 3113308 0 0 0 0 132 257168 40 9 51 0 1 0 5632 184912 221948 3113308 0 0 0 0 114 251802 38 9 54 0 > Note the test case assumes you've got shared_buffers set to at least > 1000; with smaller values, you may get some I/O syscalls, which will > probably skew the results. shared_buffers ---------------- 16384 (1 row) I found that killing three of the four concurrent queries dropped context switches to about 70,000 to 100,000. Two or more sessions brings it up to 200K+. Joe
When grilled further on (Mon, 19 Apr 2004 20:53:09 -0400), Tom Lane <tgl@sss.pgh.pa.us> confessed: > I wrote: > > Here is a test case. > > Hmmm ... I've been able to reproduce the CS storm on a dual Athlon, > which seems to pretty much let the Xeon per se off the hook. Anybody > got a multiple Opteron to try? Totally non-Intel CPUs? > > It would be interesting to see results with non-Linux kernels, too. > Same problem on my dual AMD MP with 2.6.5 kernel using two sessions of your test, but maybe not quite as severe. The highest CS values I saw was 102k, with some non-db number crunching going on in parallel with the test. 'Average' about 80k with two instances. Using the anticipatory scheduler. A single instance pulls in around 200-300 CS, and no tests running around 200-300 CS (i.e. no CS difference). A snipet: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- 3 0 284 90624 93452 1453740 0 0 0 0 1075 76548 83 17 0 0 6 0 284 125312 93452 1470196 0 0 0 0 1073 87702 78 22 0 0 3 0 284 178392 93460 1420208 0 0 76 298 1083 67721 77 24 0 0 4 0 284 177120 93460 1421500 0 0 1104 0 1054 89593 80 21 0 0 5 0 284 173504 93460 1425172 0 0 3584 0 1110 65536 81 19 0 0 4 0 284 169984 93460 1428708 0 0 3456 0 1098 66937 81 20 0 0 6 0 284 170944 93460 1428708 0 0 8 0 1045 66065 81 19 0 0 6 0 284 167288 93460 1428776 0 0 0 8 1097 75560 81 19 0 0 6 0 284 136296 93460 1458356 0 0 0 0 1036 80808 75 26 0 0 5 0 284 132864 93460 1461688 0 0 0 0 1007 76071 84 17 0 0 4 0 284 132880 93460 1461688 0 0 0 0 1079 86903 82 18 0 0 5 0 284 132880 93460 1461688 0 0 0 0 1078 79885 83 17 0 0 6 0 284 132648 93460 1461688 0 0 0 760 1228 66564 86 14 0 0 6 0 284 132648 93460 1461688 0 0 0 0 1047 69741 86 15 0 0 6 0 284 132672 93460 1461688 0 0 0 0 1057 79052 84 16 0 0 5 0 284 132672 93460 1461688 0 0 0 0 1054 81109 82 18 0 0 5 0 284 132736 93460 1461688 0 0 0 0 1043 91725 80 20 0 0 Cheers, Rob -- 21:33:03 up 3 days, 1:10, 3 users, load average: 5.05, 4.67, 4.22 Linux 2.6.5-01 #5 SMP Tue Apr 6 21:32:39 MDT 2004
Attachment
Same problem with dual 1Ghz P3's running Postgres 7.4.2, linux 2.4.x, and 2GB ram, under load, with long transactions (i.e. 1 "cannot serialize" rollback per minute). 200K was the worst observed with vmstat. Finally moved DB to a single xeon box.
Hi Tom, You still have an account on my Unixware Bi-Xeon hyperthreded machine. Feel free to use it for your tests. On Mon, 19 Apr 2004, Tom Lane wrote: > Date: Mon, 19 Apr 2004 20:53:09 -0400 > From: Tom Lane <tgl@sss.pgh.pa.us> > To: josh@agliodbs.com > Cc: Joe Conway <mail@joeconway.com>, scott.marlowe <scott.marlowe@ihs.com>, > Bruce Momjian <pgman@candle.pha.pa.us>, lutzeb@aeccom.com, > pgsql-performance@postgresql.org, Neil Conway <neilc@samurai.com> > Subject: Re: [PERFORM] Wierd context-switching issue on Xeon > > I wrote: > > Here is a test case. > > Hmmm ... I've been able to reproduce the CS storm on a dual Athlon, > which seems to pretty much let the Xeon per se off the hook. Anybody > got a multiple Opteron to try? Totally non-Intel CPUs? > > It would be interesting to see results with non-Linux kernels, too. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Olivier PRENANT Tel: +33-5-61-50-97-00 (Work) 6, Chemin d'Harraud Turrou +33-5-61-50-97-01 (Fax) 31190 AUTERIVE +33-6-07-63-80-64 (GSM) FRANCE Email: ohp@pyrenet.fr ------------------------------------------------------------------------------ Make your life a dream, make your dream a reality. (St Exupery)
On Apr 19, 2004, at 8:01 PM, Tom Lane wrote: [test case] Quad P3-700Mhz, ServerWorks, pg 7.4.2 - 1 process: 10-30 cs / second 2 process: 100k cs / sec 3 process: 140k cs / sec 8 process: 115k cs / sec Dual P2-450Mhz, non-serverworks (piix) - 1 process 15-20 / sec 2 process 30k / sec 3 (up to 7) process: 15k /sec (Yes, I verified with more processes the cs's drop) And finally, 6 cpu sun e4500, solaris 2.6, pg 7.4.2: 1 - 10 processes: hovered between 2-3k cs/second (there was other stuff running on the machine as well) Verrry interesting. I've got a dual G4 at home, but for convenience Apple doesn't ship a vmstat that tells context switches -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
Dual Athlon With one process running 30 cs/second with two process running 15000 cs/second Dave On Tue, 2004-04-20 at 08:46, Jeff wrote: > On Apr 19, 2004, at 8:01 PM, Tom Lane wrote: > [test case] > > Quad P3-700Mhz, ServerWorks, pg 7.4.2 - 1 process: 10-30 cs / second > 2 process: 100k cs / sec > 3 process: 140k cs / sec > 8 process: 115k cs / sec > > Dual P2-450Mhz, non-serverworks (piix) - 1 process 15-20 / sec > 2 process 30k / sec > 3 (up to 7) process: 15k /sec > > (Yes, I verified with more processes the cs's drop) > > And finally, > > 6 cpu sun e4500, solaris 2.6, pg 7.4.2: 1 - 10 processes: hovered > between 2-3k cs/second (there was other stuff running on the machine as > well) > > > Verrry interesting. > I've got a dual G4 at home, but for convenience Apple doesn't ship a > vmstat that tells context switches > > -- > Jeff Trout <jeff@jefftrout.com> > http://www.jefftrout.com/ > http://www.stuarthamm.net/ > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > > > > !DSPAM:40851da1199651145780980! > > -- Dave Cramer 519 939 0336 ICQ # 14675561
As a cross-ref to all the 7.4.x tests people have sent in, here's 7.2.3 (Redhat 7.3), Quad Xeon 700MHz/1MB L2 cache, 3GBRAM. Idle-ish (it's a production server) cs/sec ~5000 3 test queries running: procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 3 0 0 23380 577680 105912 2145140 0 0 0 0 107 116890 50 14 35 2 0 0 23380 577680 105912 2145140 0 0 0 0 114 118583 50 15 34 2 0 0 23380 577680 105912 2145140 0 0 0 0 107 115842 54 14 32 2 1 0 23380 577680 105920 2145140 0 0 0 32 156 117549 50 16 35 HTH Matt > -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Tom Lane > Sent: 20 April 2004 01:02 > To: josh@agliodbs.com > Cc: Joe Conway; scott.marlowe; Bruce Momjian; lutzeb@aeccom.com; > pgsql-performance@postgresql.org; Neil Conway > Subject: Re: [PERFORM] Wierd context-switching issue on Xeon > > > Here is a test case. To set up, run the "test_setup.sql" script once; > then launch two copies of the "test_run.sql" script. (For those of > you with more than two CPUs, see whether you need one per CPU to make > trouble, or whether two test_runs are enough.) Check that you get a > nestloops-with-index-scans plan shown by the EXPLAIN in test_run. > > In isolation, test_run.sql should do essentially no syscalls at all once > it's past the initial ramp-up. On a machine that's functioning per > expectations, multiple copies of test_run show a relatively low rate of > semop() calls --- a few per second, at most --- and maybe a delaying > select() here and there. > > What I actually see on Josh's client's machine is a context swap storm: > "vmstat 1" shows CS rates around 170K/sec. strace'ing the backends > shows a corresponding rate of semop() syscalls, with a few delaying > select()s sprinkled in. top(1) shows system CPU percent of 25-30 > and idle CPU percent of 16-20. > > I haven't bothered to check how long the test_run query takes, but if it > ends while you're still examining the behavior, just start it again. > > Note the test case assumes you've got shared_buffers set to at least > 1000; with smaller values, you may get some I/O syscalls, which will > probably skew the results. > > regards, tom lane > >
Hi Tom, Just to explain our hardware situation releated to the FSB of the XEON's. We have older XEON DP in operation with FSB 400 and 2.4 GHz. The XEON MP box runs with 2.5 GHz. The XEON MP box is a Fujitsu Siemens Primergy RX600 with ServerWorks GC LE as chipset. The box, which Dirk were use to compare the behavior, is our newest XEON DP system. This XEON DP box runs with 2.8 GHz and FSB 533 using the Intel 7501 chipset (Supermicro). I would agree to Jush. When PostgreSQL has an issue with the INTEL XEON MP hardware, this is more releated to the chipset. Back to the SQL-Level. We use SELECT FOR UPDATE as "semaphore". Should we try another implementation for this semahore on the client side to prevent this issue? Regards Sven. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: <lutzeb@aeccom.com> Cc: "Josh Berkus" <josh@agliodbs.com>; <pgsql-performance@postgreSQL.org>; "Neil Conway" <neilc@samurai.com> Sent: Sunday, April 18, 2004 11:47 PM Subject: Re: [PERFORM] Wierd context-switching issue on Xeon > After some further digging I think I'm starting to understand what's up > here, and the really fundamental answer is that a multi-CPU Xeon MP box > sucks for running Postgres. > > I did a bunch of oprofile measurements on a machine belonging to one of > Josh's clients, using a test case that involved heavy concurrent access > to a relatively small amount of data (little enough to fit into Postgres > shared buffers, so that no I/O or kernel calls were really needed once > the test got going). I found that by nearly any measure --- elapsed > time, bus transactions, or machine-clear events --- the spinlock > acquisitions associated with grabbing and releasing the BufMgrLock took > an unreasonable fraction of the time. I saw about 15% of elapsed time, > 40% of bus transactions, and nearly 100% of pipeline-clear cycles going > into what is essentially two instructions out of the entire backend. > (Pipeline clears occur when the cache coherency logic detects a memory > write ordering problem.) > > I am not completely clear on why this machine-level bottleneck manifests > as a lot of context swaps at the OS level. I think what is happening is > that because SpinLockAcquire is so slow, a process is much more likely > than you'd normally expect to arrive at SpinLockAcquire while another > process is also acquiring the spinlock. This puts the two processes > into a "lockstep" condition where the second process is nearly certain > to observe the BufMgrLock as locked, and be forced to suspend itself, > even though the time the first process holds the BufMgrLock is not > really very long at all. > > If you google for Xeon and "cache coherency" you'll find quite a bit of > suggestive information about why this might be more true on the Xeon > setup than others. A couple of interesting hits: > > http://www.theinquirer.net/?article=10797 > says that Xeon MP uses a *slower* FSB than Xeon DP. This would > translate directly to more time needed to transfer a dirty cache line > from one processor to the other, which is the basic operation that we're > talking about here. > > http://www.aceshardware.com/Spades/read.php?article_id=30000187 > says that Opterons use a different cache coherency protocol that is > fundamentally superior to the Xeon's, because dirty cache data can be > transferred directly between two processor caches without waiting for > main memory. > > So in the short term I think we have to tell people that Xeon MP is not > the most desirable SMP platform to run Postgres on. (Josh thinks that > the specific motherboard chipset being used in these machines might > share some of the blame too. I don't have any evidence for or against > that idea, but it's certainly possible.) > > In the long run, however, CPUs continue to get faster than main memory > and the price of cache contention will continue to rise. So it seems > that we need to give up the assumption that SpinLockAcquire is a cheap > operation. In the presence of heavy contention it won't be. > > One thing we probably have got to do soon is break up the BufMgrLock > into multiple finer-grain locks so that there will be less contention. > However I am wary of doing this incautiously, because if we do it in a > way that makes for a significant rise in the number of locks that have > to be acquired to access a buffer, we might end up with a net loss. > > I think Neil Conway was looking into how the bufmgr might be > restructured to reduce lock contention, but if he had come up with > anything he didn't mention exactly what. Neil? > > regards, tom lane > >
There are a few things that you can do to help force yourself to be I/O bound. These include: - RAID 5 for write intensive applications, since multiple writes per synch write is good. (There is a special case for logging or other streaming sequential writes on RAID 5) - Data journaling file systems are helpful in stress testing your checkpoints - Using midsized battery backed up write through buffering controllers. In general, if you have a small cache, you see the problem directly, and a huge cache will balance out load and defer writes to quieter times. That is why a midsized cache is so useful in showing stress in your system only when it is being stressed. Only partly in jest, /Aaron BTW - I am truly curious about what happens to your system if you use separate RAID 0+1 for your logs, disk sorts, and at least the most active tables. This should reduce I/O load by an order of magnitude. "Vivek Khera" <khera@kcilink.com> wrote in message news:x7smez7tqj.fsf@yertle.int.kciLink.com... > >>>>> "JB" == Josh Berkus <josh@agliodbs.com> writes: > > JB> Aaron, > >> I do consulting, so they're all over the place and tend to be complex. Very > >> few fit in RAM, but still are very buffered. These are almost all backed > >> with very high end I/O subsystems, with dozens of spindles with battery > >> backed up writethrough cache and gigs of buffers, which may be why I worry > >> so much about CPU. I have had this issue with multiple servers. > > JB> Aha, I think this is the difference. I never seem to be able to > JB> get my clients to fork out for adequate disk support. They are > JB> always running off single or double SCSI RAID in the host server; > JB> not the sort of setup you have. > > Even when I upgraded my system to a 14-spindle RAID5 with 128M cache > and 4GB RAM on a dual Xeon system, I still wind up being I/O bound > quite often. > > I think it depends on what your "working set" turns out to be. My > workload really spans a lot more of the DB than I can end up caching. > > -- > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > Vivek Khera, Ph.D. Khera Communications, Inc. > Internet: khera@kciLink.com Rockville, MD +1-301-869-4449 x806 > AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/ > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >
I would agree to Tom, that too much parameters are involved to blame bigmem. I have access to the following machines where the same application operates: a) Dual (4way) XEON MP, bigmem, HT off, ServerWorks chipset (a Fujitsu-Siemens Primergy) performs ok now because missing indexes were added but this is no proof that this behaviour occurs again under high load, context switches are moderate but have peaks to 40.000 b) Dual XEON DP, non-bigmem, HT on, ServerWorks chipset (a Dell machine I think) performs moderate because I see too much context switches here although the mentioned indexes are created, context switches go up to 30.000 often, I can see 50% semop calls c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro) performs well and I could not observe context switch peaks here (one user active), almost no extra semop calls d) Dual XEON DP, bigmem, HT off, ServerWorks chipset (a Fujitsu-Siemens Primergy) performance unknown at the moment (is offline) but looks like a) in the past I can offer to do tests on those machines if somebody would provide me some test instructions to nail this problem down. Dirk Tom Lane wrote: >Josh Berkus <josh@agliodbs.com> writes: > > >>The other thing I'd like your comment on, Tom, is that Dirk appears to have >>reported that when he installed a non-bigmem kernel, the issue went away. >>Dirk, is this correct? >> >> > >I'd be really surprised if that had anything to do with it. AFAIR >Dirk's test changed more than one variable and so didn't prove a >connection. > > regards, tom lane > > >
Dirk Lutzebaeck wrote: > c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro) > > performs well and I could not observe context switch peaks here (one > user active), almost no extra semop calls Did Tom's test here: with 2 processes I'll reach 200k+ CS with peaks to 300k CS. Bummer.. Josh, I don't think you can bash the ServerWorks chipset here nor bigmem. Dirk
I tried to test how this is related to cache coherency, by forcing affinity of the two test_run.sql processes to the two cores (pipelines? threads) of a single hyperthreaded xeon processor in an smp xeon box. When the processes are allowed to run on distinct chips in the smp box, the CS storm happens. When they are "bound" to the two cores of a single hyperthreaded Xeon in the smp box, the CS storm *does* happen. I used the taskset command: taskset 01 -p <pid for backend of test_run.sql 1> taskset 01 -p <pid for backend of test_run.sql 1> I guess that 0 and 1 are the two cores (pipelines? hyper-threads?) on the first Xeon processor in the box. I did this on RedHat Fedora core1 on an intel motherboard (I'll get the part no if it matters) during storms : 300k CS/sec, 75% idle (on a dual xeon (four core)) machine (suggesting serializing/sleeping processes) no storm: 50k CS/sec, 50% idle (suggesting 2 cpu bound processes) Maybe there's a "hot block" that is bouncing back and forth between caches? or maybe the page holding semaphores? On Apr 19, 2004, at 5:53 PM, Tom Lane wrote: > I wrote: >> Here is a test case. > > Hmmm ... I've been able to reproduce the CS storm on a dual Athlon, > which seems to pretty much let the Xeon per se off the hook. Anybody > got a multiple Opteron to try? Totally non-Intel CPUs? > > It would be interesting to see results with non-Linux kernels, too. > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
> It would be interesting to see results with non-Linux kernels, too. Dual Celeron 500Mhz (Abit BP6 mobo) - client & server on same machine 2 processes FreeBSD (5.2.1): 1800cs 3 processes FreeBSD: 14000cs 4 processes FreeBSD: 14500cs 2 processes Linux (2.4.18 kernel): 52000cs 3 processes Linux: 10000cs 4 processes Linux: 20000cs
Ooops, what I meant to say was that 2 threads bound to one (hyperthreaded) cpu does *NOT* cause the storm, even on an smp xeon. Therefore, the context switches may be a result of cache coherency related delays. (2 threads on one hyperthreaded cpu presumably have tightly coupled 1,l2 cache.) On Apr 20, 2004, at 1:02 PM, Paul Tuckfield wrote: > I tried to test how this is related to cache coherency, by forcing > affinity of the two test_run.sql processes to the two cores > (pipelines? threads) of a single hyperthreaded xeon processor in an > smp xeon box. > > When the processes are allowed to run on distinct chips in the smp > box, the CS storm happens. When they are "bound" to the two cores of > a single hyperthreaded Xeon in the smp box, the CS storm *does* > happen. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ er, meant *NOT HAPPEN* > > > > I used the taskset command: > taskset 01 -p <pid for backend of test_run.sql 1> > taskset 01 -p <pid for backend of test_run.sql 1> > > I guess that 0 and 1 are the two cores (pipelines? hyper-threads?) on > the first Xeon processor in the box. > > I did this on RedHat Fedora core1 on an intel motherboard (I'll get > the part no if it matters) > > during storms : 300k CS/sec, 75% idle (on a dual xeon (four core)) > machine (suggesting serializing/sleeping processes) > no storm: 50k CS/sec, 50% idle (suggesting 2 cpu bound processes) > > > Maybe there's a "hot block" that is bouncing back and forth between > caches? or maybe the page holding semaphores? > > On Apr 19, 2004, at 5:53 PM, Tom Lane wrote: > >> I wrote: >>> Here is a test case. >> >> Hmmm ... I've been able to reproduce the CS storm on a dual Athlon, >> which seems to pretty much let the Xeon per se off the hook. Anybody >> got a multiple Opteron to try? Totally non-Intel CPUs? >> >> It would be interesting to see results with non-Linux kernels, too. >> >> regards, tom lane >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 4: Don't 'kill -9' the postmaster >> > > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
Dirk, Tom, OK, off IRC, I have the following reports: Linux 2.4.21 or 2.4.20 on dual Pentium III : problem verified Linux 2.4.21 or 2.4.20 on dual Penitum II : problem cannot be reproduced Solaris 2.6 on 6 cpu e4500 (using 8 processes) : problem not reproduced -- -Josh Berkus Aglio Database Solutions San Francisco
I verified problem on a Dual Opteron server. I temporarily killed the normal load, so the server was largely idle when the test was run. Hardware: 2x Opteron 242 Rioworks HDAMA server board 4Gb RAM OS Kernel: RedHat9 + XFS 1 proc: 10-15 cs/sec 2 proc: 400,000-420,000 cs/sec j. andrew rogers
Anjan, > Quad 2.0GHz XEON with highest load we have seen on the applications, DB > performing great - Can you run Tom's test? It takes a particular pattern of data access to reproduce the issue. -- Josh Berkus Aglio Database Solutions San Francisco
Dirk Lutzeb�ck wrote: > Dirk Lutzebaeck wrote: > > > c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro) > > > > performs well and I could not observe context switch peaks here (one > > user active), almost no extra semop calls > > Did Tom's test here: with 2 processes I'll reach 200k+ CS with peaks to > 300k CS. Bummer.. Josh, I don't think you can bash the ServerWorks > chipset here nor bigmem. Dave Cramer reproduced the problem on my SuperMicro dual Xeon on BSD/OS. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
If this helps - Quad 2.0GHz XEON with highest load we have seen on the applications, DB performing great - procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 1616 351820 66144 10813704 0 0 2 0 1 1 0 2 7 3 0 0 1616 349712 66144 10813736 0 0 8 1634 1362 4650 4 2 95 0 0 0 1616 347768 66144 10814120 0 0 188 1218 1158 4203 5 1 93 0 0 1 1616 346596 66164 10814184 0 0 8 1972 1394 4773 4 1 94 2 0 1 1616 345424 66164 10814272 0 0 20 1392 1184 4197 4 2 94 Around 4k CS/sec Chipset is Intel ServerWorks GC-HE. Linux Kernel 2.4.20-28.9bigmem #1 SMP Thanks, Anjan -----Original Message----- From: Dirk Lutzebäck [mailto:lutzeb@aeccom.com] Sent: Tuesday, April 20, 2004 10:29 AM To: Tom Lane; Josh Berkus Cc: pgsql-performance@postgreSQL.org; Neil Conway Subject: Re: [PERFORM] Wierd context-switching issue on Xeon Dirk Lutzebaeck wrote: > c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro) > > performs well and I could not observe context switch peaks here (one > user active), almost no extra semop calls Did Tom's test here: with 2 processes I'll reach 200k+ CS with peaks to 300k CS. Bummer.. Josh, I don't think you can bash the ServerWorks chipset here nor bigmem. Dirk ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org
I modified the code in s_lock.c to remove the spins #define SPINS_PER_DELAY 1 and it doesn't exhibit the behaviour This effectively changes the code to while(TAS(lock)) select(10000); // 10ms Can anyone explain why executing TAS 100 times would increase context switches ? Dave On Tue, 2004-04-20 at 12:59, Josh Berkus wrote: > Anjan, > > > Quad 2.0GHz XEON with highest load we have seen on the applications, DB > > performing great - > > Can you run Tom's test? It takes a particular pattern of data access to > reproduce the issue. -- Dave Cramer 519 939 0336 ICQ # 14675561
Joe Conway wrote: >> In isolation, test_run.sql should do essentially no syscalls at all once >> it's past the initial ramp-up. On a machine that's functioning per >> expectations, multiple copies of test_run show a relatively low rate of >> semop() calls --- a few per second, at most --- and maybe a delaying >> select() here and there. Here's results for 7.4 on a dual Athlon server running fedora core: CPU states: cpu user nice system irq softirq iowait idle total 86.0% 0.0% 52.4% 0.0% 0.0% 0.0% 61.2% cpu00 37.6% 0.0% 29.7% 0.0% 0.0% 0.0% 32.6% cpu01 48.5% 0.0% 22.7% 0.0% 0.0% 0.0% 28.7% procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs 1 0 120448 25764 48300 1094576 0 0 0 124 170 187 1 0 120448 25780 48300 1094576 0 0 0 0 152 89 2 0 120448 25744 48300 1094580 0 0 0 60 141 78290 2 0 120448 25752 48300 1094580 0 0 0 0 131 140326 2 0 120448 25756 48300 1094576 0 0 0 40 122 140100 2 0 120448 25764 48300 1094584 0 0 0 60 133 136595 2 0 120448 24284 48300 1094584 0 0 0 200 138 135151 The jump in cs corresponds to starting the query in the second session. Joe
Hi, Dual Xeon P4 2.8 linux RedHat AS 3 kernel 2.4.21-4-EL-smp 2 GB ram I can see the same problem: procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 96212 61056 1720240 0 0 0 0 101 11 25 0 75 0 1 0 0 96212 61056 1720240 0 0 0 0 108 139 25 0 75 0 1 0 0 96212 61056 1720240 0 0 0 0 104 173 25 0 75 0 1 0 0 96212 61056 1720240 0 0 0 0 102 11 25 0 75 0 1 0 0 96212 61056 1720240 0 0 0 0 101 11 25 0 75 0 2 0 0 96204 61056 1720240 0 0 0 0 110 53866 31 4 65 0 2 0 0 96204 61056 1720240 0 0 0 0 101 83176 41 5 54 0 2 0 0 96204 61056 1720240 0 0 0 0 102 86050 39 6 55 0 2 0 0 96204 61056 1720240 0 0 0 49 113 73642 41 5 54 0 2 0 0 96204 61056 1720240 0 0 0 0 102 84211 40 5 55 0 2 0 0 96204 61056 1720240 0 0 0 0 101 105165 39 7 54 0 2 0 0 96204 61056 1720240 0 0 0 0 103 97754 38 6 56 0 2 0 0 96204 61056 1720240 0 0 0 0 103 113668 36 7 57 0 2 0 0 96204 61056 1720240 0 0 0 0 103 112003 37 7 56 0 regards, ivan.
How long is this test supposed to run? I've launched just 1 for testing, the plan seems horrible; the test is cpu bound and hasn't finished yet after 17:02 min of CPU time, dual XEON 2.6G Unixware 713 The machine is a Fujitsu-Siemens TX 200 server On Mon, 19 Apr 2004, Tom Lane wrote: > Date: Mon, 19 Apr 2004 20:01:56 -0400 > From: Tom Lane <tgl@sss.pgh.pa.us> > To: josh@agliodbs.com > Cc: Joe Conway <mail@joeconway.com>, scott.marlowe <scott.marlowe@ihs.com>, > Bruce Momjian <pgman@candle.pha.pa.us>, lutzeb@aeccom.com, > pgsql-performance@postgresql.org, Neil Conway <neilc@samurai.com> > Subject: Re: [PERFORM] Wierd context-switching issue on Xeon > > Here is a test case. To set up, run the "test_setup.sql" script once; > then launch two copies of the "test_run.sql" script. (For those of > you with more than two CPUs, see whether you need one per CPU to make > trouble, or whether two test_runs are enough.) Check that you get a > nestloops-with-index-scans plan shown by the EXPLAIN in test_run. > > In isolation, test_run.sql should do essentially no syscalls at all once > it's past the initial ramp-up. On a machine that's functioning per > expectations, multiple copies of test_run show a relatively low rate of > semop() calls --- a few per second, at most --- and maybe a delaying > select() here and there. > > What I actually see on Josh's client's machine is a context swap storm: > "vmstat 1" shows CS rates around 170K/sec. strace'ing the backends > shows a corresponding rate of semop() syscalls, with a few delaying > select()s sprinkled in. top(1) shows system CPU percent of 25-30 > and idle CPU percent of 16-20. > > I haven't bothered to check how long the test_run query takes, but if it > ends while you're still examining the behavior, just start it again. > > Note the test case assumes you've got shared_buffers set to at least > 1000; with smaller values, you may get some I/O syscalls, which will > probably skew the results. > > regards, tom lane > > -- Olivier PRENANT Tel: +33-5-61-50-97-00 (Work) 6, Chemin d'Harraud Turrou +33-5-61-50-97-01 (Fax) 31190 AUTERIVE +33-6-07-63-80-64 (GSM) FRANCE Email: ohp@pyrenet.fr ------------------------------------------------------------------------------ Make your life a dream, make your dream a reality. (St Exupery)
It is intended to run indefinately. Dirk ohp@pyrenet.fr wrote: >How long is this test supposed to run? > >I've launched just 1 for testing, the plan seems horrible; the test is cpu >bound and hasn't finished yet after 17:02 min of CPU time, dual XEON 2.6G >Unixware 713 > >The machine is a Fujitsu-Siemens TX 200 server > >
After some testing if you use the current head code for s_lock.c which has some mods in it to alleviate this situation, and change SPINS_PER_DELAY to 10 you can drastically reduce the cs with tom's test. I am seeing a slight degradation in throughput using pgbench -c 10 -t 1000 but it might be liveable, considering the alternative is unbearable in some situations. Can anyone else replicate my results? Dave On Wed, 2004-04-21 at 08:10, Dirk_Lutzebäck wrote: > It is intended to run indefinately. > > Dirk > > ohp@pyrenet.fr wrote: > > >How long is this test supposed to run? > > > >I've launched just 1 for testing, the plan seems horrible; the test is cpu > >bound and hasn't finished yet after 17:02 min of CPU time, dual XEON 2.6G > >Unixware 713 > > > >The machine is a Fujitsu-Siemens TX 200 server > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > > > !DSPAM:40866735106778584283649! > > -- Dave Cramer 519 939 0336 ICQ # 14675561
Dave, > After some testing if you use the current head code for s_lock.c which > has some mods in it to alleviate this situation, and change > SPINS_PER_DELAY to 10 you can drastically reduce the cs with tom's test. > I am seeing a slight degradation in throughput using pgbench -c 10 -t > 1000 but it might be liveable, considering the alternative is unbearable > in some situations. > > Can anyone else replicate my results? Can you produce a patch against 7.4.1? I'd like to test your fix against a real-world database. -- Josh Berkus Aglio Database Solutions San Francisco
Dave: Why would test and set increase context swtches: Note that it *does not increase* context swtiches when the two threads are on the two cores of a single Xeon processor. (use taskset to force affinity on linux) Scenario: If the two test and set processes are testing and setting the same bit as each other, then they'll see worst case cache coherency misses. They'll ping a cache line back and forth between CPUs. Another case, might be that they're tesing and setting different bits or words, but those bits or words are always in the same cache line, again causing worst case cache coherency and misses. The fact that tis doesn't happen when the threads are bound to the 2 cores of a single Xeon suggests it's because they're now sharing L1 cache. No pings/bounces. I wonder do the threads stall so badly when pinging cache lines back and forth, that the kernel sees it as an opportunity to put the process to sleep? or do these worst case misses cause an interrupt? My question is: What is it that the two threads waiting for when they spin? Is it exactly the same resource, or two resources that happen to have test-and-set flags in the same cache line? On Apr 20, 2004, at 7:41 PM, Dave Cramer wrote: > I modified the code in s_lock.c to remove the spins > > #define SPINS_PER_DELAY 1 > > and it doesn't exhibit the behaviour > > This effectively changes the code to > > > while(TAS(lock)) > select(10000); // 10ms > > Can anyone explain why executing TAS 100 times would increase context > switches ? > > Dave > > > On Tue, 2004-04-20 at 12:59, Josh Berkus wrote: >> Anjan, >> >>> Quad 2.0GHz XEON with highest load we have seen on the applications, >>> DB >>> performing great - >> >> Can you run Tom's test? It takes a particular pattern of data >> access to >> reproduce the issue. > -- > Dave Cramer > 519 939 0336 > ICQ # 14675561 > > > ---------------------------(end of > broadcast)--------------------------- > TIP 8: explain analyze is your friend >
Paul Tuckfield <paul@tuckfield.com> writes: > I wonder do the threads stall so badly when pinging cache lines back > and forth, that the kernel sees it as an opportunity to put the > process to sleep? or do these worst case misses cause an interrupt? No; AFAICS the kernel could not even be aware of that behavior. The context swap storm is happening because of contention at the next level up (LWLocks rather than spinlocks). It could be an independent issue that just happens to be triggered by the same sort of access pattern. I put forward a hypothesis that the cache miss storm caused by the test-and-set ops induces the context swap storm by making the code more likely to be executing in certain places at certain times ... but it's only a hypothesis. Yesterday evening I had pretty well convinced myself that they were indeed independent issues: profiling on a single-CPU machine was telling me that the test case I proposed spends over 10% of its time inside ReadBuffer, which certainly seems like enough to explain a high rate of contention on the BufMgrLock, without any assumptions about funny behavior at the hardware level. However, your report and Dave's suggest that there really is some linkage. So I'm still confused. regards, tom lane
FYI, I am doing my testing on non hyperthreading dual athlons. Also, the test and set is attempting to set the same resource, and not simply a bit. It's really an lock;xchg in assemblelr. Also we are using the PAUSE mnemonic, so we should not be seeing any cache coherency issues, as the cache is being taken out of the picture AFAICS ? Dave On Wed, 2004-04-21 at 14:19, Paul Tuckfield wrote: > Dave: > > Why would test and set increase context swtches: > Note that it *does not increase* context swtiches when the two threads > are on the two cores of a single Xeon processor. (use taskset to force > affinity on linux) > > Scenario: > If the two test and set processes are testing and setting the same bit > as each other, then they'll see worst case cache coherency misses. > They'll ping a cache line back and forth between CPUs. Another case, > might be that they're tesing and setting different bits or words, but > those bits or words are always in the same cache line, again causing > worst case cache coherency and misses. The fact that tis doesn't > happen when the threads are bound to the 2 cores of a single Xeon > suggests it's because they're now sharing L1 cache. No pings/bounces. > > > I wonder do the threads stall so badly when pinging cache lines back > and forth, that the kernel sees it as an opportunity to put the > process to sleep? or do these worst case misses cause an interrupt? > > My question is: What is it that the two threads waiting for when they > spin? Is it exactly the same resource, or two resources that happen to > have test-and-set flags in the same cache line? > > On Apr 20, 2004, at 7:41 PM, Dave Cramer wrote: > > > I modified the code in s_lock.c to remove the spins > > > > #define SPINS_PER_DELAY 1 > > > > and it doesn't exhibit the behaviour > > > > This effectively changes the code to > > > > > > while(TAS(lock)) > > select(10000); // 10ms > > > > Can anyone explain why executing TAS 100 times would increase context > > switches ? > > > > Dave > > > > > > On Tue, 2004-04-20 at 12:59, Josh Berkus wrote: > >> Anjan, > >> > >>> Quad 2.0GHz XEON with highest load we have seen on the applications, > >>> DB > >>> performing great - > >> > >> Can you run Tom's test? It takes a particular pattern of data > >> access to > >> reproduce the issue. > > -- > > Dave Cramer > > 519 939 0336 > > ICQ # 14675561 > > > > > > ---------------------------(end of > > broadcast)--------------------------- > > TIP 8: explain analyze is your friend > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > > > > !DSPAM:4086c4d0263544680737483! > > -- Dave Cramer 519 939 0336 ICQ # 14675561
attached. -- Dave Cramer 519 939 0336 ICQ # 14675561
Attachment
Kenneth Marshall <ktm@is.rice.edu> writes: > If the context swap storm derives from LWLock contention, maybe using > a random order to assign buffer locks in buf_init.c would prevent > simple adjacency of buffer allocation to cause the storm. Good try, but no cigar ;-). The test cases I've been looking at take only shared locks on the per-buffer locks, so that's not where the context swaps are coming from. The swaps have to be caused by the BufMgrLock, because that's the only exclusive lock being taken. I did try increasing the allocated size of the spinlocks to 128 bytes to see if it would do anything. It didn't ... regards, tom lane
Dave Cramer <pg@fastcrypt.com> writes: > diff -c -r1.16 s_lock.c > *** backend/storage/lmgr/s_lock.c 8 Aug 2003 21:42:00 -0000 1.16 > --- backend/storage/lmgr/s_lock.c 21 Apr 2004 20:27:34 -0000 > *************** > *** 76,82 **** > * The select() delays are measured in centiseconds (0.01 sec) because 10 > * msec is a common resolution limit at the OS level. > */ > ! #define SPINS_PER_DELAY 100 > #define NUM_DELAYS 1000 > #define MIN_DELAY_CSEC 1 > #define MAX_DELAY_CSEC 100 > --- 76,82 ---- > * The select() delays are measured in centiseconds (0.01 sec) because 10 > * msec is a common resolution limit at the OS level. > */ > ! #define SPINS_PER_DELAY 10 > #define NUM_DELAYS 1000 > #define MIN_DELAY_CSEC 1 > #define MAX_DELAY_CSEC 100 As far as I can tell, this does reduce the rate of semop's significantly, but it does so by bringing the overall processing rate to a crawl :-(. I see 97% CPU idle time when using this patch. I believe what is happening is that the select() delay in s_lock.c is being hit frequently because the spin loop isn't allowed to run long enough to let the other processor get out of the spinlock. regards, tom lane
Tom, > As far as I can tell, this does reduce the rate of semop's > significantly, but it does so by bringing the overall processing rate > to a crawl :-(. I see 97% CPU idle time when using this patch. > I believe what is happening is that the select() delay in s_lock.c is > being hit frequently because the spin loop isn't allowed to run long > enough to let the other processor get out of the spinlock. Also, I tested it on production data, and it reduces the CSes by about 40%. An improvement, but not a magic bullet. -- Josh Berkus Aglio Database Solutions San Francisco
Dave Cramer <pg@fastcrypt.com> writes: > I tried increasing the NUM_SPINS to 1000 and it works better. Doesn't surprise me. The value of 100 is about right on the assumption that the spinlock instruction per se is not too much more expensive than any other instruction. What I was seeing from oprofile suggested that the spinlock instruction cost about 100x more than an ordinary instruction :-( ... so maybe 200 or so would be good on a Xeon. > This is certainly heading in the right direction ? Although it looks > like it is highly dependent on the system you are running on. Yeah. I don't know a reasonable way to tune this number automatically for particular systems ... but at the very least we'd need to find a way to distinguish uniprocessor from multiprocessor, because on a uniprocessor the optimal value is surely 1. regards, tom lane
> Yeah. I don't know a reasonable way to tune this number automatically > for particular systems ... but at the very least we'd need to find a way > to distinguish uniprocessor from multiprocessor, because on a > uniprocessor the optimal value is surely 1. From TODO: * Add code to detect an SMP machine and handle spinlocks accordingly from distributted.net, http://www1.distributed.net/source, in client/common/cpucheck.cpp Chris
Tom Lane wrote: > Dave Cramer <pg@fastcrypt.com> writes: > > I tried increasing the NUM_SPINS to 1000 and it works better. > > Doesn't surprise me. The value of 100 is about right on the assumption > that the spinlock instruction per se is not too much more expensive than > any other instruction. What I was seeing from oprofile suggested that > the spinlock instruction cost about 100x more than an ordinary > instruction :-( ... so maybe 200 or so would be good on a Xeon. > > > This is certainly heading in the right direction ? Although it looks > > like it is highly dependent on the system you are running on. > > Yeah. I don't know a reasonable way to tune this number automatically > for particular systems ... but at the very least we'd need to find a way > to distinguish uniprocessor from multiprocessor, because on a > uniprocessor the optimal value is surely 1. Have you looked at the code pointed to by our TODO item: * Add code to detect an SMP machine and handle spinlocks accordingly from distributted.net, http://www1.distributed.net/source, in client/common/cpucheck.cpp For BSDOS it has: #if (CLIENT_OS == OS_FREEBSD) || (CLIENT_OS == OS_BSDOS) || \ (CLIENT_OS == OS_OPENBSD) || (CLIENT_OS == OS_NETBSD) { /* comment out if inappropriate for your *bsd - cyp (25/may/1999) */ int ncpus; size_t len = sizeof(ncpus); int mib[2]; mib[0] = CTL_HW; mib[1] = HW_NCPU; if (sysctl( &mib[0], 2, &ncpus, &len, NULL, 0 ) == 0) //if (sysctlbyname("hw.ncpu", &ncpus, &len, NULL, 0 ) == 0) cpucount = ncpus; } and I can confirm that on my computer it works: hw.ncpu = 2 -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > For BSDOS it has: > #if (CLIENT_OS == OS_FREEBSD) || (CLIENT_OS == OS_BSDOS) || \ > (CLIENT_OS == OS_OPENBSD) || (CLIENT_OS == OS_NETBSD) > { /* comment out if inappropriate for your *bsd - cyp (25/may/1999) */ > int ncpus; size_t len = sizeof(ncpus); > int mib[2]; mib[0] = CTL_HW; mib[1] = HW_NCPU; > if (sysctl( &mib[0], 2, &ncpus, &len, NULL, 0 ) == 0) > //if (sysctlbyname("hw.ncpu", &ncpus, &len, NULL, 0 ) == 0) > cpucount = ncpus; > } Multiplied by how many platforms? Ewww... I was wondering about some sort of dynamic adaptation, roughly along the lines of "whenever a spin loop successfully gets the lock after spinning, decrease the allowed loop count by one; whenever we fail to get the lock after spinning, increase by 100; if the loop count reaches, say, 10000, decide we are on a uniprocessor and irreversibly set it to 1." As written this would tend to incur a select() delay once per hundred spinlock acquisitions, which is way too much, but I think we could make it work with a sufficiently slow adaptation rate. The tricky part is that a slow adaptation rate means we can't have every backend figuring this out for itself --- the right value would have to be maintained globally, and I'm not sure how to do that without adding a lot of overhead. regards, tom lane
Paul Tuckfield <paul@tuckfield.com> writes: >> I used the taskset command: >> taskset 01 -p <pid for backend of test_run.sql 1> >> taskset 01 -p <pid for backend of test_run.sql 1> >> >> I guess that 0 and 1 are the two cores (pipelines? hyper-threads?) on >> the first Xeon processor in the box. AFAICT, what you've actually done here is to bind both backends to the first logical processor of the first Xeon. If you'd used 01 and 02 as the affinity masks then you'd have bound them to the two cores of that Xeon, but what you actually did simply reduces the system to a uniprocessor. In that situation the context swap rate will be normally one swap per scheduler timeslice, and at worst two swaps per timeslice (if a process is swapped away from while it holds a lock the other one wants). It doesn't prove a lot about our SMP problem though. I don't have access to a Xeon with both taskset and hyperthreading enabled, so I can't check what happens when you do the taskset correctly ... could you retry? regards, tom lane
Yeah, I did some more testing myself, and actually get better numbers with increasing spins per delay to 1000, but my suspicion is that it is highly dependent on finding the right delay for the processor you are on. My hypothesis is that if you spin approximately the same or more time than the average time it takes to get finished with the shared resource then this should reduce cs. Certainly more ideas are required here. Dave On Wed, 2004-04-21 at 22:35, Tom Lane wrote: > Dave Cramer <pg@fastcrypt.com> writes: > > diff -c -r1.16 s_lock.c > > *** backend/storage/lmgr/s_lock.c 8 Aug 2003 21:42:00 -0000 1.16 > > --- backend/storage/lmgr/s_lock.c 21 Apr 2004 20:27:34 -0000 > > *************** > > *** 76,82 **** > > * The select() delays are measured in centiseconds (0.01 sec) because 10 > > * msec is a common resolution limit at the OS level. > > */ > > ! #define SPINS_PER_DELAY 100 > > #define NUM_DELAYS 1000 > > #define MIN_DELAY_CSEC 1 > > #define MAX_DELAY_CSEC 100 > > --- 76,82 ---- > > * The select() delays are measured in centiseconds (0.01 sec) because 10 > > * msec is a common resolution limit at the OS level. > > */ > > ! #define SPINS_PER_DELAY 10 > > #define NUM_DELAYS 1000 > > #define MIN_DELAY_CSEC 1 > > #define MAX_DELAY_CSEC 100 > > > As far as I can tell, this does reduce the rate of semop's > significantly, but it does so by bringing the overall processing rate > to a crawl :-(. I see 97% CPU idle time when using this patch. > I believe what is happening is that the select() delay in s_lock.c is > being hit frequently because the spin loop isn't allowed to run long > enough to let the other processor get out of the spinlock. > > regards, tom lane > > > > !DSPAM:40872f7e21492906114513! > > -- Dave Cramer 519 939 0336 ICQ # 14675561
More data.... On a dual xeon with HTT enabled: I tried increasing the NUM_SPINS to 1000 and it works better. NUM_SPINLOCKS CS ID pgbench 100 250K 59% 230 TPS 1000 125K 55% 228 TPS This is certainly heading in the right direction ? Although it looks like it is highly dependent on the system you are running on. --dc-- On Wed, 2004-04-21 at 22:53, Josh Berkus wrote: > Tom, > > > As far as I can tell, this does reduce the rate of semop's > > significantly, but it does so by bringing the overall processing rate > > to a crawl :-(. I see 97% CPU idle time when using this patch. > > I believe what is happening is that the select() delay in s_lock.c is > > being hit frequently because the spin loop isn't allowed to run long > > enough to let the other processor get out of the spinlock. > > Also, I tested it on production data, and it reduces the CSes by about 40%. > An improvement, but not a magic bullet. -- Dave Cramer 519 939 0336 ICQ # 14675561
Dave Cramer <pg@fastcrypt.com> writes: > My hypothesis is that if you spin approximately the same or more time > than the average time it takes to get finished with the shared resource > then this should reduce cs. The only thing we use spinlocks for nowadays is to protect LWLocks, so the "average time" involved is fairly small and stable --- or at least that was the design intention. What we seem to be seeing is that on SMP machines, cache coherency issues cause the TAS step itself to be expensive and variable. However, in the experiments I did, strace'ing showed that actual spin timeouts (manifested by the execution of a delaying select()) weren't actually that common; the big source of context switches is semop(), which indicates contention at the LWLock level rather than the spinlock level. So while tuning the spinlock limit count might be a useful thing to do in general, I think it will have only negligible impact on the particular problems we're discussing in this thread. regards, tom lane
Tom, > The tricky > part is that a slow adaptation rate means we can't have every backend > figuring this out for itself --- the right value would have to be > maintained globally, and I'm not sure how to do that without adding a > lot of overhead. This may be a moot point, since you've stated that changing the loop timing won't solve the problem, but what about making the test part of make? I don't think too many systems are going to change processor architectures once in production, and those that do can be told to re-compile. -- Josh Berkus Aglio Database Solutions San Francisco
Tom, > Having to recompile to run on single- vs dual-processor machines doesn't > seem like it would fly. Oh, I don't know. Many applications require compiling for a target architecture; SQL Server, for example, won't use a 2nd processor without re-installation. I'm not sure about Oracle. It certainly wasn't too long ago that Linux gurus were esposing re-compiling the kernel for the machine. And it's not like they would *have* to re-compile to use PostgreSQL after adding an additional processor. Just if they wanted to maximize peformance benefit. Also, this is a fairly rare circumstance, I think; to judge by my clients, once a database server is in production nobody touches the hardware. -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > This may be a moot point, since you've stated that changing the loop timing > won't solve the problem, but what about making the test part of make? I > don't think too many systems are going to change processor architectures once > in production, and those that do can be told to re-compile. Having to recompile to run on single- vs dual-processor machines doesn't seem like it would fly. regards, tom lane
On Thu, 2004-04-22 at 13:55, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > This may be a moot point, since you've stated that changing the loop timing > > won't solve the problem, but what about making the test part of make? I > > don't think too many systems are going to change processor architectures once > > in production, and those that do can be told to re-compile. > > Having to recompile to run on single- vs dual-processor machines doesn't > seem like it would fly. Is it something the postmaster could quickly determine and set a global during the startup cycle?
Tested the sql on Quad 2.0GHz XEON/8GB RAM: During the first run, the CS shooted up more than 100k, and was randomly high/low Second process made it consistently high 100k+ Third brought it down to anaverage 80-90k Fourth brought it down to an average of 50-60k/s By cancelling the queries one-by-one, the CS started going up again. 8 logical CPUs in 'top', all of them not at all too busy, load average stood around 2 all the time. Thanks. Anjan -----Original Message----- From: Josh Berkus [mailto:josh@agliodbs.com] Sent: Tue 4/20/2004 12:59 PM To: Anjan Dave; Dirk Lutzebäck; Tom Lane Cc: pgsql-performance@postgreSQL.org; Neil Conway Subject: Re: [PERFORM] Wierd context-switching issue on Xeon Anjan, > Quad 2.0GHz XEON with highest load we have seen on the applications, DB > performing great - Can you run Tom's test? It takes a particular pattern of data access to reproduce the issue. -- Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Attachment
On Thu, 2004-04-22 at 10:37 -0700, Josh Berkus wrote: > Tom, > > > The tricky > > part is that a slow adaptation rate means we can't have every backend > > figuring this out for itself --- the right value would have to be > > maintained globally, and I'm not sure how to do that without adding a > > lot of overhead. > > This may be a moot point, since you've stated that changing the loop timing > won't solve the problem, but what about making the test part of make? I > don't think too many systems are going to change processor architectures once > in production, and those that do can be told to re-compile. Sure they do - PostgreSQL is regularly provided as a pre-compiled distribution. I haven't compiled PostgreSQL for years, and we have it running on dozens of machines, some SMP, some not, but most running Debian Linux. Even having a compiler _installed_ on one of our client's database servers would usually be considered against security procedures, and would get a black mark when the auditors came through. Regards, Andrew McMillan ------------------------------------------------------------------------- Andrew @ Catalyst .Net .NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267 Planning an election? Call us! -------------------------------------------------------------------------
Tom Lane wrote: > > Hmmm ... I've been able to reproduce the CS storm on a dual Athlon, > which seems to pretty much let the Xeon per se off the hook. Anybody > got a multiple Opteron to try? Totally non-Intel CPUs? > > It would be interesting to see results with non-Linux kernels, too. > > regards, tom lane I also tested on an dual Athlon MP Tyan Thunder motherboard (2xMP2800+, 2.5GB memory), and got the same high numbers. I then ran with kernel 2.6.5, it lowered them a little, but it's still some ping pong effect here. I wonder if this is some effect of the scheduler, maybe the shed frequency alone (100HZ vs 1000HZ). It would be interesting to see what a locking implementation ala FUTEX style would give on an 2.6 kernel, as i understood it that would work cross process with some work. The first file attached is kernel 2.4 running one process then starting up the other one. Same with second file, but with kernel 2.6... Regards Magnus procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 1828408 27852 528852 0 0 0 0 317 557 50 0 50 0 1 0 0 1828408 27852 528852 0 0 0 0 293 491 50 0 49 0 1 0 0 1828400 27860 528852 0 0 0 16 399 709 50 0 50 0 1 0 0 1828400 27860 528852 0 0 0 0 350 593 50 0 49 0 2 0 0 1828400 27860 528852 0 0 0 0 349 608 50 0 50 0 1 0 0 1828400 27860 528852 0 0 0 0 109 412 50 0 50 0 1 0 0 1828400 27860 528852 0 0 0 0 101 92 50 0 50 0 1 0 0 1828392 27868 528852 0 0 0 16 104 96 50 0 50 0 1 0 0 1828392 27868 528852 0 0 0 0 101 103 50 0 50 0 2 0 0 1827408 27892 528852 0 0 8 48 113 61197 45 9 46 0 2 0 0 1827408 27892 528852 0 0 0 0 101 167237 41 27 32 0 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 4 0 0 1827408 27892 528852 0 0 0 0 101 166145 39 25 36 0 2 0 0 1827400 27900 528852 0 0 0 48 105 149406 42 19 40 0 3 0 0 1827400 27900 528852 0 0 0 0 101 157559 43 26 32 0 2 0 0 1827400 27900 528852 0 0 0 0 101 163813 46 24 30 0 2 0 0 1827400 27900 528852 0 0 0 0 101 156872 44 26 30 0 2 0 0 1827400 27900 528852 0 0 0 0 103 160722 45 28 28 0 2 0 0 1827392 27908 528852 0 0 0 16 104 158644 41 23 37 0 3 0 0 1827392 27908 528852 0 0 0 0 101 157534 42 25 33 0 2 0 0 1827392 27908 528852 0 0 0 0 101 160007 37 28 35 0 3 0 0 1827392 27908 528852 0 0 0 0 101 161852 45 24 31 0 3 0 0 1827392 27908 528852 0 0 0 0 101 161616 42 25 33 0 2 0 0 1827392 27916 528852 0 0 0 68 114 152144 44 25 31 0 2 0 0 1827384 27916 528852 0 0 0 0 101 156485 35 28 37 0 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 2436044 8844 90028 0 0 0 16 1010 235 50 0 50 0 1 0 0 2436108 8844 90028 0 0 0 0 1024 404 50 0 50 0 1 0 0 2436108 8844 90028 0 0 0 0 1008 199 50 0 50 0 1 0 0 2436108 8844 90028 0 0 0 0 1017 272 50 0 50 0 1 0 0 2436108 8844 90028 0 0 0 0 1013 253 50 0 50 0 1 1 0 2436108 8852 90020 0 0 0 16 1019 282 51 0 49 1 2 0 0 2435068 8852 90020 0 0 0 0 1005 23929 45 4 50 0 2 0 0 2435068 8852 90020 0 0 0 20 1008 95501 33 14 53 0 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 3 0 0 2435068 8852 90020 0 0 0 0 1002 103940 35 15 50 0 0 0 0 2435068 8852 90020 0 0 0 0 1003 104343 32 16 51 0 2 0 0 2435068 8860 90080 0 0 0 52 1006 102477 34 16 51 1 2 0 0 2435068 8860 90080 0 0 0 0 1002 92809 31 14 54 0 2 0 0 2435068 8860 90080 0 0 0 0 1002 100498 37 14 49 0 1 0 0 2435068 8860 90080 0 0 0 0 1002 108130 35 16 49 0 0 0 0 2435068 8860 90080 0 0 0 0 1002 94045 33 14 54 0 0 0 0 2435004 8868 90072 0 0 0 16 1005 104380 34 15 52 0 2 0 0 2435004 8868 90072 0 0 0 0 1002 100696 36 14 50 0 2 0 0 2435068 8868 90072 0 0 0 0 1002 98289 31 14 54 0 0 0 0 2435068 8868 90072 0 0 0 0 1002 97287 31 14 55 0 0 0 0 2435068 8868 90072 0 0 0 0 1002 92787 34 14 53 0 0 0 0 2435068 8876 90064 0 0 0 16 1005 98568 32 16 52 1 2 0 0 2435068 8876 90064 0 0 0 0 1003 107104 37 16 47 0
On Wed, Apr 21, 2004 at 02:51:31PM -0400, Tom Lane wrote: > The context swap storm is happening because of contention at the next > level up (LWLocks rather than spinlocks). It could be an independent > issue that just happens to be triggered by the same sort of access > pattern. I put forward a hypothesis that the cache miss storm caused by > the test-and-set ops induces the context swap storm by making the code > more likely to be executing in certain places at certain times ... but > it's only a hypothesis. > If the context swap storm derives from LWLock contention, maybe using a random order to assign buffer locks in buf_init.c would prevent simple adjacency of buffer allocation to cause the storm. Just offsetting the assignment by the cacheline size should work. I notice that when initializing the buffers in shared memory, both the buf->meta_data_lock and the buf->cntx_lock are immediately adjacent in memory. I am not familiar enough with the flow through postgres to see if there could be "fighting" for those two locks. If so, offsetting those by the cache line size would also stop the context swap storm. --Ken
Magus, > It would be interesting to see what a locking implementation ala FUTEX > style would give on an 2.6 kernel, as i understood it that would work > cross process with some work. I'mm working on testing a FUTEX patch, but am having some trouble with it. Will let you know the results .... -- -Josh Berkus Aglio Database Solutions San Francisco
Dave, > Yeah, I did some more testing myself, and actually get better numbers > with increasing spins per delay to 1000, but my suspicion is that it is > highly dependent on finding the right delay for the processor you are > on. Well, it certainly didn't help here: procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 0 14870744 123872 1129912 0 0 0 0 1027 187341 48 27 26 0 2 0 0 14869912 123872 1129912 0 0 0 48 1030 126490 65 18 16 0 2 0 0 14867032 123872 1129912 0 0 0 0 1021 106046 72 16 12 0 2 0 0 14869912 123872 1129912 0 0 0 0 1025 90256 76 14 10 0 2 0 0 14870424 123872 1129912 0 0 0 0 1022 135249 63 22 16 0 2 0 0 14872664 123872 1129912 0 0 0 0 1023 131111 63 20 17 0 1 0 0 14871128 123872 1129912 0 0 0 48 1024 155728 57 22 20 0 2 0 0 14871128 123872 1129912 0 0 0 0 1028 189655 49 29 22 0 2 0 0 14871064 123872 1129912 0 0 0 0 1018 190744 48 29 23 0 2 0 0 14871064 123872 1129912 0 0 0 0 1027 186812 51 26 23 0 -- -Josh Berkus Aglio Database Solutions San Francisco
Are you testing this with Tom's code, you need to do a baseline measurement with 10 and then increase it, you will still get lots of cs, but it will be less. Dave On Mon, 2004-04-26 at 20:03, Josh Berkus wrote: > Dave, > > > Yeah, I did some more testing myself, and actually get better numbers > > with increasing spins per delay to 1000, but my suspicion is that it is > > highly dependent on finding the right delay for the processor you are > > on. > > Well, it certainly didn't help here: > > procs memory swap io system cpu > r b swpd free buff cache si so bi bo in cs us sy id wa > 2 0 0 14870744 123872 1129912 0 0 0 0 1027 187341 48 27 > 26 0 > 2 0 0 14869912 123872 1129912 0 0 0 48 1030 126490 65 18 > 16 0 > 2 0 0 14867032 123872 1129912 0 0 0 0 1021 106046 72 16 > 12 0 > 2 0 0 14869912 123872 1129912 0 0 0 0 1025 90256 76 14 10 > 0 > 2 0 0 14870424 123872 1129912 0 0 0 0 1022 135249 63 22 > 16 0 > 2 0 0 14872664 123872 1129912 0 0 0 0 1023 131111 63 20 > 17 0 > 1 0 0 14871128 123872 1129912 0 0 0 48 1024 155728 57 22 > 20 0 > 2 0 0 14871128 123872 1129912 0 0 0 0 1028 189655 49 29 > 22 0 > 2 0 0 14871064 123872 1129912 0 0 0 0 1018 190744 48 29 > 23 0 > 2 0 0 14871064 123872 1129912 0 0 0 0 1027 186812 51 26 > 23 0 -- Dave Cramer 519 939 0336 ICQ # 14675561
Dave, > Are you testing this with Tom's code, you need to do a baseline > measurement with 10 and then increase it, you will still get lots of cs, > but it will be less. No, that was just a test of 1000 straight up. Tom outlined a method, but I didn't see any code that would help me find a better level, other than just trying each +100 increase one at a time. This would take days of testing ... -- Josh Berkus Aglio Database Solutions San Francisco
Josh, I think you can safely increase by orders of magnitude here, instead of by +100, my wild ass guess is that the sweet spot is the spin time should be approximately the time it takes to consume the resource. So if you have a really fast machine then the spin count should be higher. Also you have to take into consideration your memory bus speed, with the pause instruction inserted in the loop the timing is now dependent on memory speed. But... you need a baseline first. Dave On Tue, 2004-04-27 at 14:05, Josh Berkus wrote: > Dave, > > > Are you testing this with Tom's code, you need to do a baseline > > measurement with 10 and then increase it, you will still get lots of cs, > > but it will be less. > > No, that was just a test of 1000 straight up. Tom outlined a method, but I > didn't see any code that would help me find a better level, other than just > trying each +100 increase one at a time. This would take days of testing > ... -- Dave Cramer 519 939 0336 ICQ # 14675561
Dave, > But... you need a baseline first. A baseline on CS? I have that .... -- -Josh Berkus Aglio Database Solutions San Francisco
When grilled further on (Wed, 21 Apr 2004 10:29:43 -0700), Josh Berkus <josh@agliodbs.com> confessed: > Dave, > > > After some testing if you use the current head code for s_lock.c which > > has some mods in it to alleviate this situation, and change > > SPINS_PER_DELAY to 10 you can drastically reduce the cs with tom's test. > > I am seeing a slight degradation in throughput using pgbench -c 10 -t > > 1000 but it might be liveable, considering the alternative is unbearable > > in some situations. > > > > Can anyone else replicate my results? > > Can you produce a patch against 7.4.1? I'd like to test your fix against a > real-world database. I would like to see the same, as I have a system that exhibits the same behavior on a production db that's running 7.4.1. Cheers, Rob -- 18:55:22 up 1:40, 4 users, load average: 2.00, 2.04, 2.00 Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004
Attachment
Hi I'd LOVE to contribute on this but I don't have vmstat and I'm not running linux. How can I help? Regards On Wed, 28 Apr 2004, Robert Creager wrote: > Date: Wed, 28 Apr 2004 18:57:53 -0600 > From: Robert Creager <Robert_Creager@LogicalChaos.org> > To: Josh Berkus <josh@agliodbs.com> > Cc: pg@fastcrypt.com, Dirk_Lutzebäck <lutzeb@aeccom.com>, ohp@pyrenet.fr, > Tom Lane <tgl@sss.pgh.pa.us>, Joe Conway <mail@joeconway.com>, > scott.marlowe <scott.marlowe@ihs.com>, > Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-performance@postgresql.org, > Neil Conway <neilc@samurai.com> > Subject: Re: [PERFORM] Wierd context-switching issue on Xeon > > When grilled further on (Wed, 21 Apr 2004 10:29:43 -0700), > Josh Berkus <josh@agliodbs.com> confessed: > > > Dave, > > > > > After some testing if you use the current head code for s_lock.c which > > > has some mods in it to alleviate this situation, and change > > > SPINS_PER_DELAY to 10 you can drastically reduce the cs with tom's test. > > > I am seeing a slight degradation in throughput using pgbench -c 10 -t > > > 1000 but it might be liveable, considering the alternative is unbearable > > > in some situations. > > > > > > Can anyone else replicate my results? > > > > Can you produce a patch against 7.4.1? I'd like to test your fix against a > > real-world database. > > I would like to see the same, as I have a system that exhibits the same behavior > on a production db that's running 7.4.1. > > Cheers, > Rob > > > -- Olivier PRENANT Tel: +33-5-61-50-97-00 (Work) 6, Chemin d'Harraud Turrou +33-5-61-50-97-01 (Fax) 31190 AUTERIVE +33-6-07-63-80-64 (GSM) FRANCE Email: ohp@pyrenet.fr ------------------------------------------------------------------------------ Make your life a dream, make your dream a reality. (St Exupery)
Rob, > I would like to see the same, as I have a system that exhibits the same behavior > on a production db that's running 7.4.1. If you checked the thread follow-ups, you'd see that *decreasing* spins_per_delay was not beneficial. Instead, try increasing them, one step at a time: (take baseline measurement at 100) 250 500 1000 1500 2000 3000 5000 ... until you find an optimal level. Then report the results to us! -- -Josh Berkus Aglio Database Solutions San Francisco
When grilled further on (Thu, 29 Apr 2004 11:21:51 -0700), Josh Berkus <josh@agliodbs.com> confessed: > spins_per_delay was not beneficial. Instead, try increasing them, one step > at a time: > > (take baseline measurement at 100) > 250 > 500 > 1000 > 1500 > 2000 > 3000 > 5000 > > ... until you find an optimal level. Then report the results to us! > Some results. The patch mentioned is what Dave Cramer posted to the Performance list on 4/21. A Perl script monitored <vmstat 1> for 120 seconds and generated max and average values. Unfortunately, I am not present on site, so I cannot physically change the device under test to increase the db load to where it hit about 10 days ago. That will have to wait till the 'real' work week on Monday. Context switches - avg max Default 7.4.1 code : 10665 69470 Default patch - 10 : 17297 21929 patch at 100 : 26825 87073 patch at 1000 : 37580 110849 Now granted, the db isn't showing the CS swap problem in a bad way (at all), but should the numbers be trending the way they are with the patched code? Or will these numbers potentially change dramatically when I can load up the db? And, presuming I can re-produce what I was seeing previously (200K CS/s), you folks want me to carry on with more testing of the patch and report the results? Or just go away and be quiet... The information is provided from a HP Proliant DL380 G3 with 2x 2.4 GHZ Xenon's (with HT enabled) 2 GB ram, running 2.4.22-26mdkenterprise kernel, RAID controller w/128 Mb battery backed cache RAID 1 on 2x 15K RPM drives for WAL drive, RAID 0+1 on 4x 10K RPM drives for data. The only job this box has is running this db. Cheers, Rob -- 21:54:48 up 2 days, 4:39, 4 users, load average: 2.00, 2.03, 2.00 Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004
Attachment
No, don't go away and be quiet. Keep testing, it may be that under normal operation the context switching goes up but under the conditions that you were seeing the high CS it may not be as bad. As others have mentioned the real solution to this is to rewrite the buffer management so that the lock isn't quite as coarse grained. Dave On Sat, 2004-05-01 at 00:03, Robert Creager wrote: > When grilled further on (Thu, 29 Apr 2004 11:21:51 -0700), > Josh Berkus <josh@agliodbs.com> confessed: > > > spins_per_delay was not beneficial. Instead, try increasing them, one step > > at a time: > > > > (take baseline measurement at 100) > > 250 > > 500 > > 1000 > > 1500 > > 2000 > > 3000 > > 5000 > > > > ... until you find an optimal level. Then report the results to us! > > > > Some results. The patch mentioned is what Dave Cramer posted to the Performance > list on 4/21. > > A Perl script monitored <vmstat 1> for 120 seconds and generated max and average > values. Unfortunately, I am not present on site, so I cannot physically change > the device under test to increase the db load to where it hit about 10 days ago. > That will have to wait till the 'real' work week on Monday. > > Context switches - avg max > > Default 7.4.1 code : 10665 69470 > Default patch - 10 : 17297 21929 > patch at 100 : 26825 87073 > patch at 1000 : 37580 110849 > > Now granted, the db isn't showing the CS swap problem in a bad way (at all), but > should the numbers be trending the way they are with the patched code? Or will > these numbers potentially change dramatically when I can load up the db? > > And, presuming I can re-produce what I was seeing previously (200K CS/s), you > folks want me to carry on with more testing of the patch and report the results? > Or just go away and be quiet... > > The information is provided from a HP Proliant DL380 G3 with 2x 2.4 GHZ Xenon's > (with HT enabled) 2 GB ram, running 2.4.22-26mdkenterprise kernel, RAID > controller w/128 Mb battery backed cache RAID 1 on 2x 15K RPM drives for WAL > drive, RAID 0+1 on 4x 10K RPM drives for data. The only job this box has is > running this db. > > Cheers, > Rob -- Dave Cramer 519 939 0336 ICQ # 14675561
Found some co-workers at work yesterday to load up my library... The sample period is 5 minutes long (vs 2 minutes previously): Context switches - avg max Default 7.4.1 code : 48784 107354 Default patch - 10 : 20400 28160 patch at 100 : 38574 85372 patch at 1000 : 41188 106569 The reading at 1000 was not produced under the same circumstances as the prior readings as I had to replace my device under test with a simulated one. The real one died. The previous run with smaller database and 120 second averages: Context switches - avg max Default 7.4.1 code : 10665 69470 Default patch - 10 : 17297 21929 patch at 100 : 26825 87073 patch at 1000 : 37580 110849 -- 20:13:50 up 3 days, 2:58, 4 users, load average: 2.12, 2.14, 2.10 Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004
Attachment
Robert, The real question is does it help under real life circumstances ? Did you do the tests with Tom's sql code that is designed to create high context switchs ? Dave On Sun, 2004-05-02 at 11:20, Robert Creager wrote: > Found some co-workers at work yesterday to load up my library... > > The sample period is 5 minutes long (vs 2 minutes previously): > > Context switches - avg max > > Default 7.4.1 code : 48784 107354 > Default patch - 10 : 20400 28160 > patch at 100 : 38574 85372 > patch at 1000 : 41188 106569 > > The reading at 1000 was not produced under the same circumstances as the prior > readings as I had to replace my device under test with a simulated one. The > real one died. > > The previous run with smaller database and 120 second averages: > > Context switches - avg max > > Default 7.4.1 code : 10665 69470 > Default patch - 10 : 17297 21929 > patch at 100 : 26825 87073 > patch at 1000 : 37580 110849 -- Dave Cramer 519 939 0336 ICQ # 14675561
When grilled further on (Sun, 02 May 2004 11:39:22 -0400), Dave Cramer <pg@fastcrypt.com> confessed: > Robert, > > The real question is does it help under real life circumstances ? I'm not yet at the point where the CS's are causing appreciable delays. I should get there early this week and will be able to measure the relief your patch may provide. > > Did you do the tests with Tom's sql code that is designed to create high > context switchs ? No, I'm using my queries/data. Cheers, Rob -- 10:44:58 up 3 days, 17:30, 4 users, load average: 2.00, 2.04, 2.01 Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004
Attachment
Did we ever come to a conclusion about excessive SMP context switching under load? --------------------------------------------------------------------------- Dave Cramer wrote: > Robert, > > The real question is does it help under real life circumstances ? > > Did you do the tests with Tom's sql code that is designed to create high > context switchs ? > > Dave > On Sun, 2004-05-02 at 11:20, Robert Creager wrote: > > Found some co-workers at work yesterday to load up my library... > > > > The sample period is 5 minutes long (vs 2 minutes previously): > > > > Context switches - avg max > > > > Default 7.4.1 code : 48784 107354 > > Default patch - 10 : 20400 28160 > > patch at 100 : 38574 85372 > > patch at 1000 : 41188 106569 > > > > The reading at 1000 was not produced under the same circumstances as the prior > > readings as I had to replace my device under test with a simulated one. The > > real one died. > > > > The previous run with smaller database and 120 second averages: > > > > Context switches - avg max > > > > Default 7.4.1 code : 10665 69470 > > Default patch - 10 : 17297 21929 > > patch at 100 : 26825 87073 > > patch at 1000 : 37580 110849 > -- > Dave Cramer > 519 939 0336 > ICQ # 14675561 > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
When grilled further on (Wed, 19 May 2004 21:20:20 -0400 (EDT)), Bruce Momjian <pgman@candle.pha.pa.us> confessed: > > Did we ever come to a conclusion about excessive SMP context switching > under load? > I just figured out what was causing the problem on my system Monday. I'm using the pg_autovacuum daemon, and it was not vacuuming my db. I've no idea why and didn't get a chance to investigate. This lack of vacuuming was causing a huge number of context switches and query delays. the queries that normally take .1 seconds were taking 11 seconds, and the context switches were averaging 160k/s, peaking at 190k/s Unfortunately, I was under pressure to fix the db at the time so I didn't get a chance to play with the patch. I restarted the vacuum daemon, and will keep an eye on it to see if it behaves. If the problem re-occurs, is it worth while to attempt the different patch delay settings? Cheers, Rob -- 19:45:40 up 21 days, 2:30, 4 users, load average: 2.03, 2.09, 2.06 Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004
Attachment
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Did we ever come to a conclusion about excessive SMP context switching > under load? Yeah: it's bad. Oh, you wanted a fix? That seems harder :-(. AFAICS we need a redesign that causes less load on the BufMgrLock. However, the traditional solution to too-much-contention-for-a-lock is to break up the locked data structure into finer-grained units, which means *more* lock operations in total. Normally you expect that the finer-grained lock units will mean less contention. But given that the issue here seems to be trading physical ownership of the lock's cache line back and forth, I'm afraid that the traditional approach would actually make things worse. The SMP issue seems to be not with whether there is instantaneous contention for the locked datastructure, but with the cost of making it possible for processor B to acquire a lock recently held by processor A. regards, tom lane
Robert Creager <Robert_Creager@LogicalChaos.org> writes: > I just figured out what was causing the problem on my system Monday. > I'm using the pg_autovacuum daemon, and it was not vacuuming my db. Do you have the post-7.4.2 datatype fixes for pg_autovacuum? regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Did we ever come to a conclusion about excessive SMP context switching > > under load? > > Yeah: it's bad. > > Oh, you wanted a fix? That seems harder :-(. AFAICS we need a redesign > that causes less load on the BufMgrLock. However, the traditional > solution to too-much-contention-for-a-lock is to break up the locked > data structure into finer-grained units, which means *more* lock > operations in total. Normally you expect that the finer-grained lock > units will mean less contention. But given that the issue here seems to > be trading physical ownership of the lock's cache line back and forth, > I'm afraid that the traditional approach would actually make things > worse. The SMP issue seems to be not with whether there is > instantaneous contention for the locked datastructure, but with the cost > of making it possible for processor B to acquire a lock recently held by > processor A. I see. I don't even see a TODO in there. :-( -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
When grilled further on (Wed, 19 May 2004 22:42:26 -0400), Tom Lane <tgl@sss.pgh.pa.us> confessed: > Robert Creager <Robert_Creager@LogicalChaos.org> writes: > > I just figured out what was causing the problem on my system Monday. > > I'm using the pg_autovacuum daemon, and it was not vacuuming my db. > > Do you have the post-7.4.2 datatype fixes for pg_autovacuum? No. I'm still running 7.4.1 w/associated contrib. I guess an upgrade is in order then. I'm currently downloading 7.4.2 to see what the change is that I need. Is it just the 7.4.2 pg_autovacuum that is needed here? I've caught a whiff that 7.4.3 is nearing release? Any idea when? Thanks, Rob -- 20:45:52 up 21 days, 3:30, 4 users, load average: 2.02, 2.05, 2.05 Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004
Attachment
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> ... The SMP issue seems to be not with whether there is >> instantaneous contention for the locked datastructure, but with the cost >> of making it possible for processor B to acquire a lock recently held by >> processor A. > I see. I don't even see a TODO in there. :-( Nothing more specific than "investigate SMP context switching issues", anyway. We are definitely in a research mode here, rather than an engineering mode. ObQuote: "Research is what I am doing when I don't know what I am doing." - attributed to Werner von Braun, but has anyone got a definitive reference? regards, tom lane
Robert Creager <Robert_Creager@LogicalChaos.org> writes: > Tom Lane <tgl@sss.pgh.pa.us> confessed: >> Do you have the post-7.4.2 datatype fixes for pg_autovacuum? > No. I'm still running 7.4.1 w/associated contrib. I guess an upgrade is in > order then. I'm currently downloading 7.4.2 to see what the change is that I > need. Is it just the 7.4.2 pg_autovacuum that is needed here? Nope, the fixes I was thinking about just missed the 7.4.2 release. I think you can only get them from CVS. (Maybe we should offer a nightly build of the latest stable release branch, not only development tip...) > I've caught a whiff that 7.4.3 is nearing release? Any idea when? Not scheduled yet, but there was talk of pushing one out before 7.5 goes into feature freeze. regards, tom lane
OK, added to TODO: * Investigate SMP context switching issues --------------------------------------------------------------------------- Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> ... The SMP issue seems to be not with whether there is > >> instantaneous contention for the locked datastructure, but with the cost > >> of making it possible for processor B to acquire a lock recently held by > >> processor A. > > > I see. I don't even see a TODO in there. :-( > > Nothing more specific than "investigate SMP context switching issues", > anyway. We are definitely in a research mode here, rather than an > engineering mode. > > ObQuote: "Research is what I am doing when I don't know what I am > doing." - attributed to Werner von Braun, but has anyone got a > definitive reference? > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > Robert Creager <Robert_Creager@LogicalChaos.org> writes: > > Tom Lane <tgl@sss.pgh.pa.us> confessed: > >> Do you have the post-7.4.2 datatype fixes for pg_autovacuum? > > > No. I'm still running 7.4.1 w/associated contrib. I guess an upgrade is in > > order then. I'm currently downloading 7.4.2 to see what the change is that I > > need. Is it just the 7.4.2 pg_autovacuum that is needed here? > > Nope, the fixes I was thinking about just missed the 7.4.2 release. > I think you can only get them from CVS. (Maybe we should offer a > nightly build of the latest stable release branch, not only development > tip...) > > > I've caught a whiff that 7.4.3 is nearing release? Any idea when? > > Not scheduled yet, but there was talk of pushing one out before 7.5 goes > into feature freeze. We need the temp table autovacuum fix before we do 7.4.3. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Wed, 2004-05-19 at 21:59, Robert Creager wrote: > When grilled further on (Wed, 19 May 2004 21:20:20 -0400 (EDT)), > Bruce Momjian <pgman@candle.pha.pa.us> confessed: > > > > > Did we ever come to a conclusion about excessive SMP context switching > > under load? > > > > I just figured out what was causing the problem on my system Monday. I'm using > the pg_autovacuum daemon, and it was not vacuuming my db. I've no idea why and > didn't get a chance to investigate. Strange. There is a known bug in the 7.4.2 version of pg_autovacuum related to data type mismatches which is fixed in CVS. But that bug doesn't cause pg_autovacuum to stop vacuuming but rather to vacuum to often. So perhaps this is a different issue? Please let me know what you find. Thanks, Matthew O'Connor
In an attempt to throw the authorities off his trail, tgl@sss.pgh.pa.us (Tom Lane) transmitted: > ObQuote: "Research is what I am doing when I don't know what I am > doing." - attributed to Werner von Braun, but has anyone got a > definitive reference? <http://www.quotationspage.com/search.php3?Author=Wernher+von+Braun&file=other> That points to a bunch of seemingly authoritative sources... -- (reverse (concatenate 'string "moc.enworbbc" "@" "enworbbc")) http://www.ntlug.org/~cbbrowne/lsf.html "Terrrrrific." -- Ford Prefect
Guys, > Oh, you wanted a fix? That seems harder :-(. AFAICS we need a redesign > that causes less load on the BufMgrLock. FWIW, we've been pursuing two routes of quick patch fixes. 1) Dave Cramer and I have been testing setting varying rates of spin_delay in an effort to find a "sweet spot" that the individual system seems to like. This has been somewhat delayed by my illness. 2) The OSDL folks have been trying various patches to use Linux 2.6 Futexes in place of semops (if I have that right) which, if successful, would produce a linux-specific fix. However, they haven't yet come up wiith a version of the patch which is stable. I'm really curious, BTW, about how all of Jan's changes to buffer usage in 7.5 affect this issue. Has anyone tested it on a recent snapshot? -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > I'm really curious, BTW, about how all of Jan's changes to buffer > usage in 7.5 affect this issue. Has anyone tested it on a recent > snapshot? Won't help. (1) Theoretical argument: the problem case is select-only and touches few enough buffers that it need never visit the kernel. The buffer management algorithm is thus irrelevant since there are never any decisions for it to make. If anything CVS tip will have a worse problem because its more complicated management algorithm needs to spend longer holding the BufMgrLock. (2) Experimental argument: I believe that I did check the self-contained test case we eventually developed against CVS tip on one of Red Hat's SMP machines, and indeed it was unhappy. regards, tom lane