Thread: pg 8.1.3, AIX, huge box, painfully slow.
Bing-bong, passenger announcement.. the panic train is now pulling into platform 8.1.3. Bing-bong. =) OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650 (8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot up and our website is next to unusable. The IBM is not swapping (not with 16GB of RAM!), disk i/o is low, but there must be something critically wrong for this monster to be performing so badly.. There is little IO (maybe 500KB/sec), but the CPUs are often at 100% usage. VACUUM VERBOSE ANALYZE shows me 40000 page slots are needed to track all free space. I have 160000 page slots configured, and this machine is dedicated to pg. The thing that really winds me up about this, is that aside from all the normal 'my postgres is slow l0lz!' troubleshooting is the previous machine (Debian sarge on four 3GHz Xeons) is using 8.1.3 also, with an inferior I/O subsystem, and it churns through the workload very merrily, only reaching a full loadavg of 4 at peak times, and running our main 'hotelsearch' function in ~1000ms.. This IBM on the other hand is often taking 5-10 seconds to do the same thing - although just by watching the logs it's clear to see the workload coming in waves, and then calming down again. (this correlation is matched by watching the load-balancer's logs as it takes unresponsive webservers out of the cluster) Here's the differences (I've removed obvious things like file/socket paths) in "select name,setting from pg_catalog.pg_settings" between the two: --- cayenne 2006-04-07 18:43:48.000000000 +0100 # quad xeon +++ jalapeno 2006-04-07 18:44:08.000000000 +0100 # ibm 650 - effective_cache_size | 320000 + effective_cache_size | 640000 - integer_datetimes | on + integer_datetimes | off - maintenance_work_mem | 262144 + maintenance_work_mem | 1048576 - max_connections | 150 + max_connections | 100 - max_fsm_pages | 66000 + max_fsm_pages | 160000 - max_stack_depth | 2048 + max_stack_depth | 16384 - tcp_keepalives_count | 0 - tcp_keepalives_idle | 0 - tcp_keepalives_interval | 0 - temp_buffers | 1000 - TimeZone | GB + tcp_keepalives_count | 8 + tcp_keepalives_idle | 7200 + tcp_keepalives_interval | 75 + temp_buffers | 4000 + TimeZone | GMT0BST,M3.5.0,M10.5.0 - wal_sync_method | fdatasync - work_mem | 4096 + wal_sync_method | open_datasync + work_mem | 16384 So, jalapeno really should have much more room to move. shared_buffers is 60000 on both machines. I'm reaching the end of my tether here - our search functions are just so extensive and my pg knowledge is so small that it's overwhelming to try and step through it to find any bottlenecks :( Just to reiterate, it all runs great on cayenne since we trimmed a lot of the fat out of the search, and I can't understand why the IBM box isn't absolutely throwing queries out the door :) Cheers, Gavin.
Gavin Hamill <gdh@laterooms.com> writes: > OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650 > (8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot > up and our website is next to unusable. The IBM is not swapping (not > with 16GB of RAM!), disk i/o is low, but there must be something > critically wrong for this monster to be performing so badly.. Have you vacuumed/analyzed since reloading your data? Compare some EXPLAIN ANALYZE outputs for identical queries on the two machines, that usually helps figure out what's wrong. regards, tom lane
On Fri, 2006-04-07 at 12:58, Gavin Hamill wrote: > Bing-bong, passenger announcement.. the panic train is now pulling into > platform 8.1.3. Bing-bong. =) > > OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650 > (8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot > up and our website is next to unusable. The IBM is not swapping (not > with 16GB of RAM!), disk i/o is low, but there must be something > critically wrong for this monster to be performing so badly.. > > There is little IO (maybe 500KB/sec), but the CPUs are often at 100% > usage. Can you test your AIX box with linux on it? It may well be that something in AIX is causing this performance problem. I know that on the same SPARC hardware, a postgresql database is 2 or more times faster on top of linux or BSD than it is on solaris, at least it was back a few years ago when I tested it. Are the same queries getting the same basic execution plan on both boxes? Turn on logging for slow queries, and explain analyze them on both machines to see if they are. If they aren't, figure out why. I'd put the old 4 way Xeon back in production and do some serious testing of this pSeries machine. IBM should be willing to help you, I hope. My guess is that this is an OS issue. Maybe there are AIX tweaks that will get it up to the same or higher level of performance as your four way xeon. Maybe there aren't. Myself, I'd throw a spare drive in for the OS, put some flavor of linux on it http://www-1.ibm.com/partnerworld/pwhome.nsf/weblook/pat_linux_learn_why_power.html and do some load testing there. If the machine can't perform up to snuff with the same basic OS and a similar setup to your Xeon, send it back to IBM and buy one of these: http://www.asaservers.com/system_dept.asp?dept_id=SD-002 or something similar. I can't imagine it costing more than a pSeries.
On Fri, 07 Apr 2006 14:41:39 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Gavin Hamill <gdh@laterooms.com> writes: > > OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650 > > (8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot > > up and our website is next to unusable. The IBM is not swapping (not > > with 16GB of RAM!), disk i/o is low, but there must be something > > critically wrong for this monster to be performing so badly.. > > Have you vacuumed/analyzed since reloading your data? Absolutely - a VACUUM FULL was the first thing I did, and have VACUUM ANALYZE VERBOSE'd a couple of times since. I have plentyof overhead to keep the entire free space map in RAM. > Compare some > EXPLAIN ANALYZE outputs for identical queries on the two machines, > that usually helps figure out what's wrong. If only :) Since 90% of the db work is the 'hotelsearch' function (which is 350 lines-worth that I'm not permitted to share :(( ), anEXPLAIN ANALYZE reveals practically nothing: ##### jalapeno (IBM) laterooms=# EXPLAIN ANALYZE select * from hotelsearch(12.48333::numeric, 41.90000::numeric, 5::int4, '2006-04-13'::date,5::int4, NULL::int4, 1::int4, NULL::int4, NULL::int4, TRUE::bool, FALSE::bool, FALSE::bool, 1::int2, 'GBP'::text,'Roma'::text, 7::int4, NULL::int4, NULL::int4) limit 500; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..6.25 rows=500 width=1587) (actual time=2922.282..2922.908 rows=255 loops=1) -> Function Scan on hotelsearch (cost=0.00..12.50 rows=1000 width=1587) (actual time=2922.277..2922.494 rows=255 loops=1) Total runtime: 2923.296 ms (3 rows) ##### cayenne (xeon) laterooms=# EXPLAIN ANALYZE select * from hotelsearch(12.48333::numeric, 41.90000::numeric, 5::int4, '2006-04-13'::date,5::int4, NULL::int4, 1::int4, NULL::int4, NULL::int4, TRUE::bool, FALSE::bool, FALSE::bool, 1::int2, 'GBP'::text,'Roma'::text, 7::int4, NULL::int4, NULL::int4) limit 500; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..6.25 rows=500 width=1587) (actual time=1929.483..1930.103 rows=255 loops=1) -> Function Scan on hotelsearch (cost=0.00..12.50 rows=1000 width=1587) (actual time=1929.479..1929.693 rows=255 loops=1) Total runtime: 1930.506 ms (3 rows) The 'LIMIT 500' is a red herring since the function body will get all data, so reducing the LIMIT in the call to hotelsearchdoesn't reduce the amount of work being done. The killer in it all is tail'ing the postgres log (which I have set only to log queries at 1000ms or up) is things will bereturning at 1000-2000ms.. then suddenly shoot up to 8000ms.. and if I try a few of those 8000ms queries on the xeon box,they exec in ~1500ms.. and if I try them again a few moments later on the ibm, they'll also exec in maybe ~2500ms. This is one hell of a moving target and I can't help but think I'm just missing something that's right in front of my nose,too close to see. Cheers, Gavin.
On Fri, 07 Apr 2006 13:54:21 -0500 Scott Marlowe <smarlowe@g2switchworks.com> wrote: > Are the same queries getting the same basic execution plan on both > boxes? Turn on logging for slow queries, and explain analyze them on > both machines to see if they are. See reply to Tom Lane :) > I'd put the old 4 way Xeon back in production and do some serious > testing of this pSeries machine. IBM should be willing to help you, I > hope. They probably would if this had been bought new - as it is, we have rented the machine for a month from a 2nd-user dealer to see if it's capable of taking the load. I'm now glad we did this. > My guess is that this is an OS issue. Maybe there are AIX tweaks that > will get it up to the same or higher level of performance as your four > way xeon. Maybe there aren't. The pSeries isn't much older than our Xeon machine, and I expected the performance level to be exemplary out of the box.. we've enabled the 64-bit kernel+userspace, and compiled pg for 64-bitness with the gcc flags as reccommended by Senica Cunningham on this very list.. > Myself, I'd throw a spare drive in for the OS, put some flavor of > linux on it Terrifying given I know nothing about the pSeries boot system, but at this stage I'm game for nearly anything. > http://www.asaservers.com/system_dept.asp?dept_id=SD-002 Multi-Opteron was the other thing we considered but decided to give 'Big Iron' UNIX a whirl... Cheers, Gavin.
Gavin Hamill <gdh@laterooms.com> writes: > Scott Marlowe <smarlowe@g2switchworks.com> wrote: >> My guess is that this is an OS issue. Maybe there are AIX tweaks that >> will get it up to the same or higher level of performance as your four >> way xeon. Maybe there aren't. > The pSeries isn't much older than our Xeon machine, and I expected the > performance level to be exemplary out of the box.. I'm fairly surprised too. One thing I note from your comparison of settings is that the default WAL sync method is different on the two operating systems. If the query load is update-heavy then it would be very worth your while to experiment with the sync method. However, if the bottleneck is pure-SELECT transactions then WAL sync should not be a factor at all. Does AIX have anything comparable to oprofile or dtrace? It'd be interesting to try to monitor things at that level and see what we can learn. Failing a low-level profiler, there should at least be something comparable to strace --- you should try watching some of the backends with strace and see what their behavior is when the performance goes south. Lots of delaying select()s or semop()s would be a red flag. regards, tom lane
On Fri, 7 Apr 2006 20:59:19 +0100 Gavin Hamill <gdh@laterooms.com> wrote: > > I'd put the old 4 way Xeon back in production and do some serious > > testing of this pSeries machine. IBM should be willing to help you, I > > hope. > > They probably would if this had been bought new - as it is, we have > rented the machine for a month from a 2nd-user dealer to see if it's > capable of taking the load. I'm now glad we did this. We also had problems with a high end AIX system and we got no help from IBM. They expected you to put Oracle on and if you used anything else you were on your own. We had exactly the same issue. We expected to get an order of magnitude improvement and instead the app bogged down. It also got worse over time. We had to reboot every night to get anything out of it. Needless to say, they got their system back. > > > My guess is that this is an OS issue. Maybe there are AIX tweaks that > > will get it up to the same or higher level of performance as your four > > way xeon. Maybe there aren't. > > The pSeries isn't much older than our Xeon machine, and I expected the > performance level to be exemplary out of the box.. we've enabled the > 64-bit kernel+userspace, and compiled pg for 64-bitness with the gcc > flags as reccommended by Senica Cunningham on this very list.. That's Seneca. We found that our money was better spent on multiple servers running NetBSD with a home grown multi-master replication system. Need more power? Just add more servers. -- D'Arcy J.M. Cain <darcy@druid.net> | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
On Fri, 2006-04-07 at 14:59, Gavin Hamill wrote: > On Fri, 07 Apr 2006 13:54:21 -0500 > Scott Marlowe <smarlowe@g2switchworks.com> wrote: > > > Are the same queries getting the same basic execution plan on both > > boxes? Turn on logging for slow queries, and explain analyze them on > > both machines to see if they are. > > See reply to Tom Lane :) I didn't see one go by yet... Could be sitting in the queue. > They probably would if this had been bought new - as it is, we have > rented the machine for a month from a 2nd-user dealer to see if it's > capable of taking the load. I'm now glad we did this. Thank god. I had a picture of you sitting on top of a brand new very expensive pSeries Let us know if changing the fsync setting helps. Hopefully that's all the problem is. Off on a tangent. If the aggregate memory bandwidth of the pSeries is no greater than you Xeon you might not see a big improvement if you were memory bound before. If you were CPU bound, you may or may not see an improvement. Can you describe the disc subsystems in the two machines for us? What kind of read / write load you have? It could be the older box was running on IDE drives with fake fsync responses which would lie, be fast, but not reliable in case of a power outage. Do you have hardware RAID for your pSeries? how many discs, how much battery backed cache, etc? > Multi-Opteron was the other thing we considered but decided to give > 'Big Iron' UNIX a whirl... It still might be a good choice, if it's a simple misconfiguration issue. But man, those new multiple core opterons can make some impressive machines for very little money.
Gavin Hamill <gdh@laterooms.com> writes: > This is one hell of a moving target and I can't help but think I'm just > missing something that's right in front of my nose, too close to see. I'm assuming you compiled postgres yourself? Do you have the output from the configure script? I'm wondering if it failed to find a good spinlock match for the architecture. Not sure if that's really likely but it's a possibility. Also, I'm pretty sure IBM has tools that would let you disable some of the processors to see if maybe it's a shared memory bus issue. If I understand you right the machine isn't in production yet? In which case I would get timing information for a single processor, two processors, four processors, and eight processors. If you see it max out and start dropping then that would point towards a hardware/low level postgres issue like spinlocks or shared memory rather than a high level database issue like stats. -- greg
Gavin Hamill wrote: > Bing-bong, passenger announcement.. the panic train is now pulling into > platform 8.1.3. Bing-bong. =) > > OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650 > (8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot > up and our website is next to unusable. The IBM is not swapping (not I would say running _one_ query at a time depends on the power of _one_ cpu. PPCs aren't that fast, I'd say they are slower than Xeons. Moreover I'm sure that AMD Opterons are faster than Xeons. I'd say you should go and test an opteron-based configuration. You'll get much more power for the same (much likely for less) money. -- Üdvözlettel, Gábriel Ákos -=E-Mail :akos.gabriel@i-logic.hu|Web: http://www.i-logic.hu=- -=Tel/fax:+3612367353 |Mobil:+36209278894 =-
On Fri, 7 Apr 2006 16:16:02 -0400 "D'Arcy J.M. Cain" <darcy@druid.net> wrote: > We also had problems with a high end AIX system and we got no help > from IBM. They expected you to put Oracle on and if you used > anything else you were on your own. Urk, I thought IBM were supposedly Linux sycophants thesedays... > We had exactly the same issue. > We expected to get an order of magnitude improvement and instead the > app bogged down. That's kind of encouraging, I suppose - that it might not be something mind-bogglingly stupid I'm doing. > It also got worse over time. We had to reboot every > night to get anything out of it. Needless to say, they got their > system back. <nod> > That's Seneca. Oops - meant to check the spelling before I sent that =) > We found that our money was better spent on multiple servers running > NetBSD with a home grown multi-master replication system. Need more > power? Just add more servers. Aye, I originally suggested multiple servers, but was talked round to one giant db so that our devels didn't have to rewrite code to deal with read/write + read-only db handles... Cheers, Gavin.
On Fri, 07 Apr 2006 15:24:18 -0500 Scott Marlowe <smarlowe@g2switchworks.com> wrote: > > See reply to Tom Lane :) > > I didn't see one go by yet... Could be sitting in the queue. If it's not arrived by now - EXPLAIN ANALYZE doesn't tell me anything :) > Let us know if changing the fsync setting helps. Hopefully that's all > the problem is. fsync's already off - yes a bit scary, but our I/O is only about 500KB/sec writing.. the whole db fits in RAM / kernel disk cache, and I'd rather have performance than security at this exact moment.. > Off on a tangent. If the aggregate memory bandwidth of the pSeries is > no greater than you Xeon you might not see a big improvement if you > were memory bound before. If you were CPU bound, you may or may not > see an improvement. I did look into the specs of the system, and the memory bw on the pSeries was /much/ greater than the Xeon - it's one of the things that really pushed me towards it in the end. I forget the figures, but it was 3 or 4 times greater. > Can you describe the disc subsystems in the two machines for us? What > kind of read / write load you have? It could be the older box was > running on IDE drives with fake fsync responses which would lie, be > fast, but not reliable in case of a power outage. Again, I'm confident that I/O's not the killer here.. the Xeon is a Dell 6850- hardware RAID1.. SCSI drives. > > Multi-Opteron was the other thing we considered but decided to give > > 'Big Iron' UNIX a whirl... > > It still might be a good choice, if it's a simple misconfiguration > issue. > > But man, those new multiple core opterons can make some impressive > machines for very little money. So I see - we could buy two quad-opterons for the cost of renting this pSeries for a month.... Cheers, Gavin.
On Fri, 07 Apr 2006 16:06:02 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The pSeries isn't much older than our Xeon machine, and I expected > > the performance level to be exemplary out of the box.. > > I'm fairly surprised too. One thing I note from your comparison of > settings is that the default WAL sync method is different on the two > operating systems. We're very read-focussed.. there's update activity, sure, but the IO is only pushing about 500KByte/sec on average, usually much less. I also have fsync switched off - yes dangerous, but I just want to eliminate IO completely as a contributing factor. > Does AIX have anything comparable to oprofile or dtrace? I've used neither on Linux, but a quick google showed up a few articles along the lines of 'in theory it shouldn't be hard to port to AIX....' but nothing concrete. My guess is IBM sell a tool to do this. Hell, the C++ compiler is £1200... (hence our use of GCC 4.1 to compile pg) > Failing a low-level profiler, there should at least be > something comparable to strace --- you should try watching some of > the backends with strace and see what their behavior is when the > performance goes south. Lots of delaying select()s or semop()s would > be a red flag. There's truss installed which seems to do the same as strace on Linux... and here's a wildly non-scientific glance.. I watched the 'topas' output (top for AIX) , identified a PID that was doing a lot of work, then attached truss to that pid. In addition to lots of send (), recv() and lseek()s... about once a minute I saw hundreds of calls to __semop() interspersed with _select(), followed by tons of lseek() +kread()+__semop() and then I can see the kwrite() to the pg logfile 246170: kwrite(2, " L O G : d u", 8) = 8 etc. Cheers, Gavin.
Gavin,
On 4/7/06 2:24 PM, "Gavin Hamill" <gdh@laterooms.com> wrote:
> I did look into the specs of the system, and the memory bw on the
> pSeries was /much/ greater than the Xeon - it's one of the things that
> really pushed me towards it in the end. I forget the figures, but it
> was 3 or 4 times greater.
From the literature at:
http://www-03.ibm.com/servers/eserver/pseries/hardware/midrange/p650_desc.html
“The pSeries 650 features a peak aggregate memory to L3 cache bandwidth of 25.6GB/second for an 8way configuration. In addition, aggregate I/O bandwidth is up to 16GB/second. The result is a remarkable combination of system architecture, speed and power that delivers efficient and cost-effective data sharing and application throughput.”
That’s a total of 25.6GB/s for 8 CPUs, or 3.2GB/s per CPU. 3GHz P4 Xeons typically have an 800MHz memory bus with double the speed at 6.4GB/s result (800MHz x 8 bytes per L2 cache line = 6.4GB/s). Furthermore, the speed at which the P4 Xeon can load data into L2 cache from memory is matched to the bus because the L2 cache line width is 8 bytes wide and can stream data to L2 at full bus speed.
That said, I find typical memory bandwidth for the P4 in applications is limited at about 2GB/s. See here for more detail: http://www.cs.virginia.edu/stream/standard/Bandwidth.html
In fact, looking at the results there, the IBM 650m2 only gets 6GB/s on all 8 CPUs. I wouldn’t be surprised if the strange L3 cache architecture of the IBM 650 is holding it back from streaming memory access efficiently.
Whether this has anything to do with your problem or not, I have no idea!
- Luke
On 4/7/06 2:24 PM, "Gavin Hamill" <gdh@laterooms.com> wrote:
> I did look into the specs of the system, and the memory bw on the
> pSeries was /much/ greater than the Xeon - it's one of the things that
> really pushed me towards it in the end. I forget the figures, but it
> was 3 or 4 times greater.
From the literature at:
http://www-03.ibm.com/servers/eserver/pseries/hardware/midrange/p650_desc.html
“The pSeries 650 features a peak aggregate memory to L3 cache bandwidth of 25.6GB/second for an 8way configuration. In addition, aggregate I/O bandwidth is up to 16GB/second. The result is a remarkable combination of system architecture, speed and power that delivers efficient and cost-effective data sharing and application throughput.”
That’s a total of 25.6GB/s for 8 CPUs, or 3.2GB/s per CPU. 3GHz P4 Xeons typically have an 800MHz memory bus with double the speed at 6.4GB/s result (800MHz x 8 bytes per L2 cache line = 6.4GB/s). Furthermore, the speed at which the P4 Xeon can load data into L2 cache from memory is matched to the bus because the L2 cache line width is 8 bytes wide and can stream data to L2 at full bus speed.
That said, I find typical memory bandwidth for the P4 in applications is limited at about 2GB/s. See here for more detail: http://www.cs.virginia.edu/stream/standard/Bandwidth.html
In fact, looking at the results there, the IBM 650m2 only gets 6GB/s on all 8 CPUs. I wouldn’t be surprised if the strange L3 cache architecture of the IBM 650 is holding it back from streaming memory access efficiently.
Whether this has anything to do with your problem or not, I have no idea!
- Luke
Gavin Hamill <gdh@laterooms.com> writes: > There's truss installed which seems to do the same as strace on > Linux... and here's a wildly non-scientific glance.. I watched the > 'topas' output (top for AIX) , identified a PID that was doing a lot of > work, then attached truss to that pid. In addition to lots of send > (), recv() and lseek()s... Those are good, they represent real work getting done. > about once a minute I saw hundreds of calls > to __semop() interspersed with _select(), This is not good. Did the semop storms coincide with visible slowdown? (I'd assume so, but you didn't actually say...) regards, tom lane
"Luke Lonergan" <llonergan@greenplum.com> writes: > That said, I find typical memory bandwidth for the P4 in applications is > limited at about 2GB/s. See here for more detail: > http://www.cs.virginia.edu/stream/standard/Bandwidth.html > In fact, looking at the results there, the IBM 650m2 only gets 6GB/s > on all 8 CPUs. I wouldn't be surprised if the strange L3 cache > architecture of the IBM 650 is holding it back from streaming memory > access efficiently. Given Gavin's latest report, I'm wondering how much the IBM slows down when a spinlock operation is involved. If the memory architecture isn't good about supporting serialized access to memory, that gaudy sounding bandwidth number might have little to do with PG's real-world behavior. On the other hand, we already know that Xeons suck about as badly as can be on that same measure; could the pSeries really be worse? regards, tom lane
Tom, On 4/7/06 3:02 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > On the other hand, we already know that Xeons suck about as badly as > can be on that same measure; could the pSeries really be worse? I wouldn't be too surprised, but it sounds like it needs a test. Do we have a test for this? Is there a contention-prone query stream that we can think up? - Luke
"Luke Lonergan" <llonergan@greenplum.com> writes: > On 4/7/06 3:02 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >> On the other hand, we already know that Xeons suck about as badly as >> can be on that same measure; could the pSeries really be worse? > I wouldn't be too surprised, but it sounds like it needs a test. Do we have > a test for this? Is there a contention-prone query stream that we can think > up? If you want you could install a pre-8.1 PG and then try one of the queries that we were using as test cases a year ago for spinlock investigations. I don't recall details right now but I remember having posted a pretty trivial test case that would send a multiprocessor machine into context-swap storm, which sounds a whole lot like what Gavin is seeing. I think that 8.1 ought to be relatively free of buffer-manager spinlock contention, which is why I doubt that test case would be interesting against 8.1. The interesting question is what else is he seeing contention for, if it's not the BufMgrLock? regards, tom lane
On Fri, 07 Apr 2006 17:56:49 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > This is not good. Did the semop storms coincide with visible > slowdown? (I'd assume so, but you didn't actually say...) If I'd been able to tell, then I'd tell you =) I'll have another go... Yes, there's a definate correlation here.. I attached truss to the main postmaster.. $ truss -Ff -p 340344 2>&1 | grep semop here's a snippet 278774: __semop(15728650, 0x0FFFFFFFFFFF7E80, 1) = 0 155712: __semop(15728650, 0x0FFFFFFFFFFF5920, 1) = 0 278774: __semop(15728649, 0x0FFFFFFFFFFF6F10, 1) 114914: __semop(15728649, 0x0FFFFFFFFFFF6A40, 1) = 0 = 0 114914: __semop(15728650, 0x0FFFFFFFFFFF61E0, 1) 155712: __semop(15728650, 0x0FFFFFFFFFFF6850, 1) = 0 = 0 155712: __semop(15728650, 0x0FFFFFFFFFFF6890, 1) = 0 1 55712: __semop(15728650, 0x0FFFFFFFFFFF5920, 1) 278774: __semop(15728650, 0x0FFFFFFFFFFF6F10, 1) 155712: __semop(15728650, 0x0FFFFFFFFFFF6850, 1) = 0 = 0 278774: __semop(15728649, 0x0FFFFFFFFFFF7E40, 1) 114914: __semop(15728649, 0x0FFFFFFFFFFF6A80, 1) = 0 = 0 278774: __semop(15728650, 0x0FFFFFFFFFFF7E80, 1) And when I saw a flood of semop's for any particular PID, a second later in the 'topas' process list would show that PID at a 100% CPU ... Most intriguing :) Cheers, Gavin.
Gavin Hamill <gdh@laterooms.com> writes: > On Fri, 07 Apr 2006 17:56:49 -0400 > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This is not good. Did the semop storms coincide with visible >> slowdown? (I'd assume so, but you didn't actually say...) > Yes, there's a definate correlation here.. I attached truss to the > main postmaster.. > ... > And when I saw a flood of semop's for any particular PID, a second later > in the 'topas' process list would show that PID at a 100% CPU ... So apparently we've still got a problem with multiprocess contention for an LWLock somewhere. It's not the BufMgrLock because that's gone in 8.1. It could be one of the finer-grain locks that are still there, or it could be someplace else. Are you in a position to try your workload using PG CVS tip? There's a nontrivial possibility that we've already fixed this --- a couple months ago I did some work to reduce contention in the lock manager: 2005-12-11 16:02 tgl * src/: backend/access/transam/twophase.c, backend/storage/ipc/procarray.c, backend/storage/lmgr/README, backend/storage/lmgr/deadlock.c, backend/storage/lmgr/lock.c, backend/storage/lmgr/lwlock.c, backend/storage/lmgr/proc.c, include/storage/lock.h, include/storage/lwlock.h, include/storage/proc.h: Divide the lock manager's shared state into 'partitions', so as to reduce contention for the former single LockMgrLock. Per my recent proposal. I set it up for 16 partitions, but on a pgbench test this gives only a marginal further improvement over 4 partitions --- we need to test more scenarios to choose the number of partitions. This is unfortunately not going to help you as far as getting that machine into production now (unless you're brave enough to run CVS tip as production, which I certainly am not). I'm afraid you're most likely going to have to ship that pSeries back at the end of the month, but while you've got it it'd be awfully nice if we could use it as a testbed ... regards, tom lane
Gavin, On 4/7/06 3:27 PM, "Gavin Hamill" <gdh@laterooms.com> wrote: > 278774: __semop(15728650, 0x0FFFFFFFFFFF7E80, 1) = 0 > 155712: __semop(15728650, 0x0FFFFFFFFFFF5920, 1) = 0 > 278774: __semop(15728649, 0x0FFFFFFFFFFF6F10, 1) > 114914: __semop(15728649, 0x0FFFFFFFFFFF6A40, 1) = 0 = 0 > 114914: __semop(15728650, 0x0FFFFFFFFFFF61E0, 1) > 155712: __semop(15728650, 0x0FFFFFFFFFFF6850, 1) = 0 = 0 > 155712: __semop(15728650, 0x0FFFFFFFFFFF6890, 1) = 0 1 > 55712: __semop(15728650, 0x0FFFFFFFFFFF5920, 1) > 278774: __semop(15728650, 0x0FFFFFFFFFFF6F10, 1) > 155712: __semop(15728650, 0x0FFFFFFFFFFF6850, 1) = 0 = 0 > 278774: __semop(15728649, 0x0FFFFFFFFFFF7E40, 1) > 114914: __semop(15728649, 0x0FFFFFFFFFFF6A80, 1) = 0 = 0 > 278774: __semop(15728650, 0x0FFFFFFFFFFF7E80, 1) Seems like you're hitting a very small target in RAM with these semop calls. I wonder what part of the code is doing this - Tom would know better how to trace it, but the equivalent of oprofile output would be nice. The other thing that I'd like to see is an evaluation of the memory access latency of this machine from Register to RAM. I couldn't find a benchmarking tool that was UNIX friendly out there, maybe I'll write one real quick. I suspect this machine has a heinous latency and a storm of semops to the same spot of RAM might be a far worse performance problem on this machine than on others... - Luke
"Luke Lonergan" <llonergan@greenplum.com> writes: > On 4/7/06 3:27 PM, "Gavin Hamill" <gdh@laterooms.com> wrote: >> 278774: __semop(15728650, 0x0FFFFFFFFFFF7E80, 1) = 0 >> 155712: __semop(15728650, 0x0FFFFFFFFFFF5920, 1) = 0 >> 278774: __semop(15728649, 0x0FFFFFFFFFFF6F10, 1) > Seems like you're hitting a very small target in RAM with these semop calls. IIRC the address passed to semop() in our code is always a local struct on the stack, so that's a bit of a red herring --- there won't be cross-processor contention for that. It's plausible though that we are seeing contention across members of the LWLock array, with the semop storm just being a higher-level symptom of the real hardware-level problem. You might try increasing LWLOCK_PADDED_SIZE to 64 or even 128, see src/backend/storage/lmgr/lwlock.c (this is something that does exist in 8.1, so it'd be easy to try). regards, tom lane
On Fri, 07 Apr 2006 18:52:20 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Are you in a position to try your workload using PG CVS tip? There's > a nontrivial possibility that we've already fixed this --- a couple > months ago I did some work to reduce contention in the lock manager: Well, there's a question. At the moment it's still live - but I'll need to swap back to the Xeon machine since I can't afford to have a Saturday with the db firing on three cylinders (out of eight :) At that point you're welcome to twiddle, compile, throw anything you want at it. If it helps us as much as the greater pg world, then that's perfect. > This is unfortunately not going to help you as far as getting that > machine into production now (unless you're brave enough to run CVS tip > as production, which I certainly am not). .. if the problem can actually be boiled down to the locking/threading issues, surely it should be straightforward to backport those changes to 8.1.3 mainline? > I'm afraid you're most > likely going to have to ship that pSeries back at the end of the > month, but while you've got it it'd be awfully nice if we could use > it as a testbed ... We have it for the next 2 weeks, and whilst I can't guarantee access for all that time, you're welcome to hammer away at it over this weekend if that's any help? Mail me privately and I'll sort out login details if this is interesting. Cheers, Gavin.
On Fri, 07 Apr 2006 15:56:52 -0700 "Luke Lonergan" <llonergan@greenplum.com> wrote: > Seems like you're hitting a very small target in RAM with these semop > calls. I wonder what part of the code is doing this - Tom would know > better how to trace it, but the equivalent of oprofile output would > be nice. I'm happy to test whatever I can, but I simply don't know enough AIX to be able to tell whether a similar kernel-level profiler is available/possible. > The other thing that I'd like to see is an evaluation of the memory > access latency of this machine from Register to RAM. I couldn't find > a benchmarking tool that was UNIX friendly out there, maybe I'll > write one real quick. I suspect this machine has a heinous latency > and a storm of semops to the same spot of RAM might be a far worse > performance problem on this machine than on others... Well, as I said to Tom, the machine is available for running tests on :) If it helps us, and helps pg become more AIX friendly, then I'm all for whatever needs done... Cheers, Gavin.
Gavin Hamill wrote: > On Fri, 07 Apr 2006 15:24:18 -0500 > Scott Marlowe <smarlowe@g2switchworks.com> wrote: > >>> See reply to Tom Lane :) >> I didn't see one go by yet... Could be sitting in the queue. > > If it's not arrived by now - EXPLAIN ANALYZE doesn't tell me > anything :) > >> Let us know if changing the fsync setting helps. Hopefully that's all >> the problem is. > > fsync's already off - yes a bit scary, but our I/O is only about > 500KB/sec writing.. the whole db fits in RAM / kernel disk cache, and > I'd rather have performance than security at this exact moment.. >>> Multi-Opteron was the other thing we considered but decided to give >>> 'Big Iron' UNIX a whirl... >> It still might be a good choice, if it's a simple misconfiguration >> issue. >> >> But man, those new multiple core opterons can make some impressive >> machines for very little money. > > So I see - we could buy two quad-opterons for the cost of renting this > pSeries for a month.... I don't know about the pSeries, but I had a client the other week with the same usage pattern. They made the switch from a quad Xeon to dual (+dual-core) Opteron and were extremely impressed. I can probably put you in touch with their sysadmin - contact me off-list if you'd like that. -- Richard Huxton Archonet Ltd
On Fri, 2006-04-07 at 19:05 -0400, Tom Lane wrote: > It's plausible though that we are seeing contention across members of > the LWLock array, with the semop storm just being a higher-level symptom > of the real hardware-level problem. You might try increasing > LWLOCK_PADDED_SIZE to 64 or even 128, see > src/backend/storage/lmgr/lwlock.c (this is something that does exist in > 8.1, so it'd be easy to try). pSeries cache lines are 128 bytes wide, so I'd go straight to 128. If you're renting all 8 CPUs, I'd drop to 4 and try that instead. With 8 CPUs the contention will vary according to what each CPU is doing at any one time - when they all hit the contention spot, things will get worse. The pSeries has good CPUs and great caching, so I'd expect contention to be somewhat more apparent as a bottleneck. Best Regards, Simon Riggs
Simon Riggs wrote: > pSeries cache lines are 128 bytes wide, so I'd go straight to 128. > Hello :) OK, that line of code is: #define LWLOCK_PADDED_SIZE (sizeof(LWLock) <= 16 ? 16 : 32) What should I change this to? I don't understand the syntax of the <= 16 ? : stuff... would a simple "#define LWLOCK_PADDED_SIZE 128" be sufficient? >If you're renting all 8 CPUs, I'd drop to 4 and try that instead. With 8 >CPUs the contention will vary according to what each CPU is doing at any >one time - when they all hit the contention spot, things will get worse. > > > We have a physical machine installed in our rack at the data centre, rather than renting a virtual partition of a real machine... I'm not sure how to enable/disable CPUs even with the help of 'smitty' :) >The pSeries has good CPUs and great caching, so I'd expect contention to >be somewhat more apparent as a bottleneck. > > > Yep, I expected 32MB of 'L3' cache would yield impressive results :) Cheers, Gavin.
Gavin Hamill <gdh@laterooms.com> writes: > would a simple "#define LWLOCK_PADDED_SIZE 128" be sufficient? Yeah, that's fine. regards, tom lane
Tom Lane wrote: > This is unfortunately not going to help you as far as getting that > machine into production now (unless you're brave enough to run CVS tip > as production, which I certainly am not). I'm afraid you're most likely > going to have to ship that pSeries back at the end of the month, but > while you've got it it'd be awfully nice if we could use it as a testbed We have PSeries boxes here that won't be going away anytime soon. If there are any specific test cases that need to run, I should be able to find the time to do it. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Tom Lane wrote: >Gavin Hamill <gdh@laterooms.com> writes: > > >>would a simple "#define LWLOCK_PADDED_SIZE 128" be sufficient? >> >> > >Yeah, that's fine. > > OK I tried that but noticed no real improvement... in the interim I've installed Debian on the pSeries (using http://debian.gonicus.de/debian/dists/sarge/main/disks-powerpc/current/pseries/install.txt ) and using a simple load-test script - it picks a 'hotelsearch' select at random from a big file and just does a pg_query on that via PHP... Using apachebench with 10 clients gave a loadavg of about 10 after a few minutes, and the logs showed typical query times of 8 seconds. Again, no disk activity, normal context-switching, just full-out CPU usage... We're improving the quality + efficiency of the hotelsearch function all the time (Simon will certainly be able to vouch for its complexity) - am really uncertain what to do next tho! :/ Cheers, Gavin.