Thread: pg 8.1.3, AIX, huge box, painfully slow.

pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
Bing-bong, passenger announcement.. the panic train is now pulling into
platform 8.1.3. Bing-bong. =)

OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650
(8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot
up and our website is next to unusable. The IBM is not swapping (not
with 16GB of RAM!), disk i/o is low, but there must be something
critically wrong for this monster to be performing so badly..

There is little IO (maybe 500KB/sec), but the CPUs are often at 100%
usage.

VACUUM VERBOSE ANALYZE shows me 40000 page slots are needed to track
all free space. I have 160000 page slots configured, and this machine is
dedicated to pg.

The thing that really winds me up about this, is that aside from all
the normal 'my postgres is slow l0lz!' troubleshooting is the previous
machine (Debian sarge on four 3GHz Xeons) is using 8.1.3 also, with an
inferior I/O subsystem, and it churns through the workload very
merrily, only reaching a full loadavg of 4 at peak times, and running
our main 'hotelsearch' function in ~1000ms..

This IBM on the other hand is often taking 5-10 seconds to do the same
thing - although just by watching the logs it's clear to see the
workload coming in waves, and then calming down again. (this
correlation is matched by watching the load-balancer's logs as it takes
unresponsive webservers out of the cluster)

Here's the differences (I've removed obvious things like file/socket
paths) in "select name,setting from pg_catalog.pg_settings" between the
two:

--- cayenne     2006-04-07 18:43:48.000000000 +0100 # quad xeon
+++ jalapeno    2006-04-07 18:44:08.000000000 +0100 # ibm 650
- effective_cache_size            | 320000
+ effective_cache_size            | 640000
- integer_datetimes               | on
+ integer_datetimes               | off
- maintenance_work_mem            | 262144
+ maintenance_work_mem            | 1048576
- max_connections                 | 150
+ max_connections                 | 100
- max_fsm_pages                   | 66000
+ max_fsm_pages                   | 160000
- max_stack_depth                 | 2048
+ max_stack_depth                 | 16384
- tcp_keepalives_count            | 0
- tcp_keepalives_idle             | 0
- tcp_keepalives_interval         | 0
- temp_buffers                    | 1000
- TimeZone                        | GB
+ tcp_keepalives_count            | 8
+ tcp_keepalives_idle             | 7200
+ tcp_keepalives_interval         | 75
+ temp_buffers                    | 4000
+ TimeZone                        | GMT0BST,M3.5.0,M10.5.0
- wal_sync_method                 | fdatasync
- work_mem                        | 4096
+ wal_sync_method                 | open_datasync
+ work_mem                        | 16384

So, jalapeno really should have much more room to move. shared_buffers
is 60000 on both machines.

I'm reaching the end of my tether here - our search functions are just
so extensive and my pg knowledge is so small that it's overwhelming to
try and step through it to find any bottlenecks :(

Just to reiterate, it all runs great on cayenne since we trimmed a lot
of the fat out of the search, and I can't understand why the IBM box
isn't absolutely throwing queries out the door :)

Cheers,
Gavin.

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Tom Lane
Date:
Gavin Hamill <gdh@laterooms.com> writes:
> OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650
> (8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot
> up and our website is next to unusable. The IBM is not swapping (not
> with 16GB of RAM!), disk i/o is low, but there must be something
> critically wrong for this monster to be performing so badly..

Have you vacuumed/analyzed since reloading your data?  Compare some
EXPLAIN ANALYZE outputs for identical queries on the two machines,
that usually helps figure out what's wrong.

            regards, tom lane

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Scott Marlowe
Date:
On Fri, 2006-04-07 at 12:58, Gavin Hamill wrote:
> Bing-bong, passenger announcement.. the panic train is now pulling into
> platform 8.1.3. Bing-bong. =)
>
> OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650
> (8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot
> up and our website is next to unusable. The IBM is not swapping (not
> with 16GB of RAM!), disk i/o is low, but there must be something
> critically wrong for this monster to be performing so badly..
>
> There is little IO (maybe 500KB/sec), but the CPUs are often at 100%
> usage.

Can you test your AIX box with linux on it?  It may well be that
something in AIX is causing this performance problem.  I know that on
the same SPARC hardware, a postgresql database is 2 or more times faster
on top of linux or BSD than it is on solaris, at least it was back a few
years ago when I tested it.

Are the same queries getting the same basic execution plan on both
boxes?  Turn on logging for slow queries, and explain analyze them on
both machines to see if they are.

If they aren't, figure out why.

I'd put the old 4 way Xeon back in production and do some serious
testing of this pSeries machine.  IBM should be willing to help you, I
hope.

My guess is that this is an OS issue.  Maybe there are AIX tweaks that
will get it up to the same or higher level of performance as your four
way xeon.  Maybe there aren't.

Myself, I'd throw a spare drive in for the OS, put some flavor of linux
on it

http://www-1.ibm.com/partnerworld/pwhome.nsf/weblook/pat_linux_learn_why_power.html

and do some load testing there.  If the machine can't perform up to
snuff with the same basic OS and a similar setup to your Xeon, send it
back to IBM and buy one of these:

http://www.asaservers.com/system_dept.asp?dept_id=SD-002

or something similar.  I can't imagine it costing more than a pSeries.

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
On Fri, 07 Apr 2006 14:41:39 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Gavin Hamill <gdh@laterooms.com> writes:
> > OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650
> > (8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot
> > up and our website is next to unusable. The IBM is not swapping (not
> > with 16GB of RAM!), disk i/o is low, but there must be something
> > critically wrong for this monster to be performing so badly..
>
> Have you vacuumed/analyzed since reloading your data?

Absolutely - a VACUUM FULL was the first thing I did, and have VACUUM ANALYZE VERBOSE'd a couple of times since. I have
plentyof overhead to keep the entire free space map in RAM. 

> Compare some
> EXPLAIN ANALYZE outputs for identical queries on the two machines,
> that usually helps figure out what's wrong.

If only :)

Since 90% of the db work is the 'hotelsearch' function (which is 350 lines-worth that I'm not permitted to share :(( ),
anEXPLAIN ANALYZE reveals practically nothing: 

##### jalapeno (IBM)
laterooms=# EXPLAIN ANALYZE select * from hotelsearch(12.48333::numeric, 41.90000::numeric, 5::int4,
'2006-04-13'::date,5::int4, NULL::int4, 1::int4, NULL::int4, NULL::int4, TRUE::bool, FALSE::bool, FALSE::bool, 1::int2,
'GBP'::text,'Roma'::text, 7::int4, NULL::int4, NULL::int4) limit 500; 
                                                          QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..6.25 rows=500 width=1587) (actual time=2922.282..2922.908 rows=255 loops=1)
   ->  Function Scan on hotelsearch  (cost=0.00..12.50 rows=1000 width=1587) (actual time=2922.277..2922.494 rows=255
loops=1)
 Total runtime: 2923.296 ms
(3 rows)

##### cayenne (xeon)
laterooms=# EXPLAIN ANALYZE select * from hotelsearch(12.48333::numeric, 41.90000::numeric, 5::int4,
'2006-04-13'::date,5::int4, NULL::int4, 1::int4, NULL::int4, NULL::int4, TRUE::bool, FALSE::bool, FALSE::bool, 1::int2,
'GBP'::text,'Roma'::text, 7::int4, NULL::int4, NULL::int4) limit 500; 
                                                          QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..6.25 rows=500 width=1587) (actual time=1929.483..1930.103 rows=255 loops=1)
   ->  Function Scan on hotelsearch  (cost=0.00..12.50 rows=1000 width=1587) (actual time=1929.479..1929.693 rows=255
loops=1)
 Total runtime: 1930.506 ms
(3 rows)


The 'LIMIT 500' is a red herring since the function body will get all data, so reducing the LIMIT in the call to
hotelsearchdoesn't reduce the amount of work being done. 

The killer in it all is tail'ing the postgres log (which I have set only to log queries at 1000ms or up) is things will
bereturning at 1000-2000ms.. then suddenly shoot up to 8000ms.. and if I try a few of those 8000ms queries on the xeon
box,they exec in ~1500ms.. and if I try them again a few moments later on the ibm, they'll also exec in maybe ~2500ms. 

This is one hell of a moving target and I can't help but think I'm just missing something that's right in front of my
nose,too close to see.  

Cheers,
Gavin.


Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
On Fri, 07 Apr 2006 13:54:21 -0500
Scott Marlowe <smarlowe@g2switchworks.com> wrote:

> Are the same queries getting the same basic execution plan on both
> boxes?  Turn on logging for slow queries, and explain analyze them on
> both machines to see if they are.

See reply to Tom Lane :)

> I'd put the old 4 way Xeon back in production and do some serious
> testing of this pSeries machine.  IBM should be willing to help you, I
> hope.

They probably would if this had been bought new - as it is, we have
rented the machine for a month from a 2nd-user dealer to see if it's
capable of taking the load. I'm now glad we did this.

> My guess is that this is an OS issue.  Maybe there are AIX tweaks that
> will get it up to the same or higher level of performance as your four
> way xeon.  Maybe there aren't.

The pSeries isn't much older than our Xeon machine, and I expected the
performance level to be exemplary out of the box.. we've enabled the
64-bit kernel+userspace, and compiled pg for 64-bitness with the gcc
flags as reccommended by Senica Cunningham on this very list..

> Myself, I'd throw a spare drive in for the OS, put some flavor of
> linux on it

Terrifying given I know nothing about the pSeries boot system, but at
this stage I'm game for nearly anything.

> http://www.asaservers.com/system_dept.asp?dept_id=SD-002

Multi-Opteron was the other thing we considered but decided to give
'Big Iron' UNIX a whirl...

Cheers,
Gavin.

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Tom Lane
Date:
Gavin Hamill <gdh@laterooms.com> writes:
> Scott Marlowe <smarlowe@g2switchworks.com> wrote:
>> My guess is that this is an OS issue.  Maybe there are AIX tweaks that
>> will get it up to the same or higher level of performance as your four
>> way xeon.  Maybe there aren't.

> The pSeries isn't much older than our Xeon machine, and I expected the
> performance level to be exemplary out of the box..

I'm fairly surprised too.  One thing I note from your comparison of
settings is that the default WAL sync method is different on the two
operating systems.  If the query load is update-heavy then it would be
very worth your while to experiment with the sync method.  However,
if the bottleneck is pure-SELECT transactions then WAL sync should not
be a factor at all.

Does AIX have anything comparable to oprofile or dtrace?  It'd be
interesting to try to monitor things at that level and see what we can
learn.  Failing a low-level profiler, there should at least be something
comparable to strace --- you should try watching some of the backends
with strace and see what their behavior is when the performance goes
south.  Lots of delaying select()s or semop()s would be a red flag.

            regards, tom lane

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
"D'Arcy J.M. Cain"
Date:
On Fri, 7 Apr 2006 20:59:19 +0100
Gavin Hamill <gdh@laterooms.com> wrote:
> > I'd put the old 4 way Xeon back in production and do some serious
> > testing of this pSeries machine.  IBM should be willing to help you, I
> > hope.
>
> They probably would if this had been bought new - as it is, we have
> rented the machine for a month from a 2nd-user dealer to see if it's
> capable of taking the load. I'm now glad we did this.

We also had problems with a high end AIX system and we got no help from
IBM.  They expected you to put Oracle on and if you used anything else
you were on your own.  We had exactly the same issue.  We expected to
get an order of magnitude improvement and instead the app bogged down.
It also got worse over time.  We had to reboot every night to get
anything out of it.  Needless to say, they got their system back.

>
> > My guess is that this is an OS issue.  Maybe there are AIX tweaks that
> > will get it up to the same or higher level of performance as your four
> > way xeon.  Maybe there aren't.
>
> The pSeries isn't much older than our Xeon machine, and I expected the
> performance level to be exemplary out of the box.. we've enabled the
> 64-bit kernel+userspace, and compiled pg for 64-bitness with the gcc
> flags as reccommended by Senica Cunningham on this very list..

That's Seneca.

We found that our money was better spent on multiple servers running
NetBSD with a home grown multi-master replication system.  Need more
power?  Just add more servers.

--
D'Arcy J.M. Cain <darcy@druid.net>         |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Scott Marlowe
Date:
On Fri, 2006-04-07 at 14:59, Gavin Hamill wrote:
> On Fri, 07 Apr 2006 13:54:21 -0500
> Scott Marlowe <smarlowe@g2switchworks.com> wrote:
>
> > Are the same queries getting the same basic execution plan on both
> > boxes?  Turn on logging for slow queries, and explain analyze them on
> > both machines to see if they are.
>
> See reply to Tom Lane :)

I didn't see one go by yet...  Could be sitting in the queue.

> They probably would if this had been bought new - as it is, we have
> rented the machine for a month from a 2nd-user dealer to see if it's
> capable of taking the load. I'm now glad we did this.

Thank god.  I had a picture of you sitting on top of a brand new very
expensive pSeries

Let us know if changing the fsync setting helps.  Hopefully that's all
the problem is.

Off on a tangent.  If the aggregate memory bandwidth of the pSeries is
no greater than you Xeon you might not see a big improvement if you were
memory bound before.  If you were CPU bound, you may or may not see an
improvement.

Can you describe the disc subsystems in the two machines for us?  What
kind of read / write load you have?  It could be the older box was
running on IDE drives with fake fsync responses which would lie, be
fast, but not reliable in case of a power outage.

Do you have hardware RAID for your pSeries?  how many discs, how much
battery backed cache, etc?

> Multi-Opteron was the other thing we considered but decided to give
> 'Big Iron' UNIX a whirl...

It still might be a good choice, if it's a simple misconfiguration
issue.

But man, those new multiple core opterons can make some impressive
machines for very little money.

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Greg Stark
Date:
Gavin Hamill <gdh@laterooms.com> writes:

> This is one hell of a moving target and I can't help but think I'm just
> missing something that's right in front of my nose, too close to see.

I'm assuming you compiled postgres yourself? Do you have the output from the
configure script? I'm wondering if it failed to find a good spinlock match for
the architecture. Not sure if that's really likely but it's a possibility.

Also, I'm pretty sure IBM has tools that would let you disable some of the
processors to see if maybe it's a shared memory bus issue. If I understand you
right the machine isn't in production yet? In which case I would get timing
information for a single processor, two processors, four processors, and eight
processors. If you see it max out and start dropping then that would point
towards a hardware/low level postgres issue like spinlocks or shared memory
rather than a high level database issue like stats.

--
greg

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gábriel Ákos
Date:
Gavin Hamill wrote:
> Bing-bong, passenger announcement.. the panic train is now pulling into
> platform 8.1.3. Bing-bong. =)
>
> OK, having moved from our quad-xeon to an 8-CPU IBM pSeries 650
> (8x1.45GHz POWER4 instead of 4 x 3GHz Xeon), our query times have shot
> up and our website is next to unusable. The IBM is not swapping (not

I would say running _one_ query at a time depends on the power of _one_
cpu. PPCs aren't that fast, I'd say they are slower than Xeons. Moreover
I'm sure that AMD Opterons are faster than Xeons. I'd say you should go
and test an opteron-based configuration. You'll get much more power for
the same (much likely for less) money.


--
Üdvözlettel,
Gábriel Ákos
-=E-Mail :akos.gabriel@i-logic.hu|Web:  http://www.i-logic.hu=-
-=Tel/fax:+3612367353            |Mobil:+36209278894            =-

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
On Fri, 7 Apr 2006 16:16:02 -0400
"D'Arcy J.M. Cain" <darcy@druid.net> wrote:

> We also had problems with a high end AIX system and we got no help
> from IBM.  They expected you to put Oracle on and if you used
> anything else you were on your own.

Urk, I thought IBM were supposedly Linux sycophants thesedays...

> We had exactly the same issue.
> We expected to get an order of magnitude improvement and instead the
> app bogged down.

That's kind of encouraging, I suppose - that it might not be something
mind-bogglingly stupid I'm doing.

> It also got worse over time.  We had to reboot every
> night to get anything out of it.  Needless to say, they got their
> system back.

<nod>

> That's Seneca.

Oops - meant to check the spelling before I sent that =)

> We found that our money was better spent on multiple servers running
> NetBSD with a home grown multi-master replication system.  Need more
> power?  Just add more servers.

Aye, I originally suggested multiple servers, but was talked round to
one giant db so that our devels didn't have to rewrite code to deal
with read/write + read-only db handles...

Cheers,
Gavin.

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
On Fri, 07 Apr 2006 15:24:18 -0500
Scott Marlowe <smarlowe@g2switchworks.com> wrote:

> > See reply to Tom Lane :)
>
> I didn't see one go by yet...  Could be sitting in the queue.

If it's not arrived by now - EXPLAIN ANALYZE doesn't tell me
anything :)

> Let us know if changing the fsync setting helps.  Hopefully that's all
> the problem is.

fsync's already off - yes a bit scary, but our I/O is only about
500KB/sec writing.. the whole db fits in RAM / kernel disk cache, and
I'd rather have performance than security at this exact moment..

> Off on a tangent.  If the aggregate memory bandwidth of the pSeries is
> no greater than you Xeon you might not see a big improvement if you
> were memory bound before.  If you were CPU bound, you may or may not
> see an improvement.

I did look into the specs of the system, and the memory bw on the
pSeries was /much/ greater than the Xeon - it's one of the things that
really pushed me towards it in the end. I forget the figures, but it
was 3 or 4 times greater.

> Can you describe the disc subsystems in the two machines for us?  What
> kind of read / write load you have?  It could be the older box was
> running on IDE drives with fake fsync responses which would lie, be
> fast, but not reliable in case of a power outage.

Again, I'm confident that I/O's not the killer here.. the Xeon is a Dell
6850- hardware RAID1.. SCSI drives.

> > Multi-Opteron was the other thing we considered but decided to give
> > 'Big Iron' UNIX a whirl...
>
> It still might be a good choice, if it's a simple misconfiguration
> issue.
>
> But man, those new multiple core opterons can make some impressive
> machines for very little money.

So I see - we could buy two quad-opterons for the cost of renting this
pSeries for a month....

Cheers,
Gavin.

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
On Fri, 07 Apr 2006 16:06:02 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > The pSeries isn't much older than our Xeon machine, and I expected
> > the performance level to be exemplary out of the box..
>
> I'm fairly surprised too.  One thing I note from your comparison of
> settings is that the default WAL sync method is different on the two
> operating systems.

We're very read-focussed.. there's update activity, sure, but the IO is
only pushing about 500KByte/sec on average, usually much less. I also
have fsync switched off - yes dangerous, but I just want to eliminate
IO completely as a contributing factor.

> Does AIX have anything comparable to oprofile or dtrace?

I've used neither on Linux, but a quick google showed up a few articles
along the lines of 'in theory it shouldn't be hard to port to AIX....'
but nothing concrete. My guess is IBM sell a tool to do this. Hell, the
C++ compiler is £1200... (hence our use of GCC 4.1 to compile pg)


> Failing a low-level profiler, there should at least be
> something comparable to strace --- you should try watching some of
> the backends with strace and see what their behavior is when the
> performance goes south.  Lots of delaying select()s or semop()s would
> be a red flag.

There's truss installed which seems to do the same as strace on
Linux... and here's a wildly non-scientific  glance..  I watched the
'topas' output (top for AIX) , identified a PID that was doing a lot of
work, then attached truss to that pid. In addition to lots of send
(), recv() and lseek()s... about once a minute I saw hundreds of calls
to __semop() interspersed with _select(), followed by tons of lseek()
+kread()+__semop() and then I can see the kwrite() to the pg logfile

246170: kwrite(2, " L O G :     d u", 8)                = 8 etc.

Cheers,
Gavin.


Re: pg 8.1.3, AIX, huge box, painfully slow.

From
"Luke Lonergan"
Date:
Gavin,

On 4/7/06 2:24 PM, "Gavin Hamill" <gdh@laterooms.com> wrote:

> I did look into the specs of the system, and the memory bw on the
> pSeries was /much/ greater than the Xeon - it's one of the things that
> really pushed me towards it in the end. I forget the figures, but it
> was 3 or 4 times greater.

From the literature at:
  http://www-03.ibm.com/servers/eserver/pseries/hardware/midrange/p650_desc.html

“The pSeries 650 features a peak aggregate memory to L3 cache bandwidth of 25.6GB/second for an 8way configuration. In addition, aggregate I/O bandwidth is up to 16GB/second. The result is a remarkable combination of system architecture, speed and power that delivers efficient and cost-effective data sharing and application throughput.”

That’s a total of 25.6GB/s for 8 CPUs, or 3.2GB/s per CPU.  3GHz P4 Xeons typically have an 800MHz memory bus with double the speed at 6.4GB/s result (800MHz x 8 bytes per L2 cache line = 6.4GB/s).  Furthermore, the speed at which the P4 Xeon can load data into L2 cache from memory is matched to the bus because the L2 cache line width is 8 bytes wide and can stream data to L2 at full bus speed.

That said, I find typical memory bandwidth for the P4 in applications is limited at about 2GB/s.  See here for more detail: http://www.cs.virginia.edu/stream/standard/Bandwidth.html

In fact, looking at the results there, the IBM 650m2 only gets 6GB/s on all 8 CPUs.  I wouldn’t be surprised if the strange L3 cache architecture of the IBM 650 is holding it back from streaming memory access efficiently.

Whether this has anything to do with your problem or not, I have no idea!

- Luke

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Tom Lane
Date:
Gavin Hamill <gdh@laterooms.com> writes:
> There's truss installed which seems to do the same as strace on
> Linux... and here's a wildly non-scientific  glance..  I watched the
> 'topas' output (top for AIX) , identified a PID that was doing a lot of
> work, then attached truss to that pid. In addition to lots of send
> (), recv() and lseek()s...

Those are good, they represent real work getting done.

> about once a minute I saw hundreds of calls
> to __semop() interspersed with _select(),

This is not good.  Did the semop storms coincide with visible slowdown?
(I'd assume so, but you didn't actually say...)

            regards, tom lane

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> That said, I find typical memory bandwidth for the P4 in applications is
> limited at about 2GB/s.  See here for more detail:
> http://www.cs.virginia.edu/stream/standard/Bandwidth.html

> In fact, looking at the results there, the IBM 650m2 only gets 6GB/s
> on all 8 CPUs.  I wouldn't be surprised if the strange L3 cache
> architecture of the IBM 650 is holding it back from streaming memory
> access efficiently.

Given Gavin's latest report, I'm wondering how much the IBM slows down
when a spinlock operation is involved.  If the memory architecture isn't
good about supporting serialized access to memory, that gaudy sounding
bandwidth number might have little to do with PG's real-world behavior.
On the other hand, we already know that Xeons suck about as badly as
can be on that same measure; could the pSeries really be worse?

            regards, tom lane

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
"Luke Lonergan"
Date:
Tom,

On 4/7/06 3:02 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> On the other hand, we already know that Xeons suck about as badly as
> can be on that same measure; could the pSeries really be worse?

I wouldn't be too surprised, but it sounds like it needs a test.  Do we have
a test for this?  Is there a contention-prone query stream that we can think
up?

- Luke



Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> On 4/7/06 3:02 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> On the other hand, we already know that Xeons suck about as badly as
>> can be on that same measure; could the pSeries really be worse?

> I wouldn't be too surprised, but it sounds like it needs a test.  Do we have
> a test for this?  Is there a contention-prone query stream that we can think
> up?

If you want you could install a pre-8.1 PG and then try one of the
queries that we were using as test cases a year ago for spinlock
investigations.  I don't recall details right now but I remember
having posted a pretty trivial test case that would send a
multiprocessor machine into context-swap storm, which sounds a whole
lot like what Gavin is seeing.

I think that 8.1 ought to be relatively free of buffer-manager spinlock
contention, which is why I doubt that test case would be interesting
against 8.1.  The interesting question is what else is he seeing
contention for, if it's not the BufMgrLock?

            regards, tom lane

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
On Fri, 07 Apr 2006 17:56:49 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> This is not good.  Did the semop storms coincide with visible
> slowdown? (I'd assume so, but you didn't actually say...)

If I'd been able to tell, then I'd tell you =) I'll have another go...

Yes, there's a definate correlation here.. I attached truss to the
main postmaster..

$ truss -Ff -p 340344 2>&1 | grep semop

here's a snippet

278774: __semop(15728650, 0x0FFFFFFFFFFF7E80, 1)        = 0
155712: __semop(15728650, 0x0FFFFFFFFFFF5920, 1)        = 0
278774: __semop(15728649, 0x0FFFFFFFFFFF6F10, 1)
114914: __semop(15728649, 0x0FFFFFFFFFFF6A40, 1)        = 0     = 0
114914: __semop(15728650, 0x0FFFFFFFFFFF61E0, 1)
155712: __semop(15728650, 0x0FFFFFFFFFFF6850, 1)        = 0     = 0
155712: __semop(15728650, 0x0FFFFFFFFFFF6890, 1)        = 0 1
55712: __semop(15728650, 0x0FFFFFFFFFFF5920, 1)
278774: __semop(15728650, 0x0FFFFFFFFFFF6F10, 1)
155712: __semop(15728650, 0x0FFFFFFFFFFF6850, 1)        = 0     = 0
278774: __semop(15728649, 0x0FFFFFFFFFFF7E40, 1)
114914: __semop(15728649, 0x0FFFFFFFFFFF6A80, 1)        = 0     = 0
278774: __semop(15728650, 0x0FFFFFFFFFFF7E80, 1)

And when I saw a flood of semop's for any particular PID, a second later
in the 'topas' process list would show that PID at a 100% CPU ...

Most intriguing :)

Cheers,
Gavin.

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Tom Lane
Date:
Gavin Hamill <gdh@laterooms.com> writes:
> On Fri, 07 Apr 2006 17:56:49 -0400
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This is not good.  Did the semop storms coincide with visible
>> slowdown? (I'd assume so, but you didn't actually say...)

> Yes, there's a definate correlation here.. I attached truss to the
> main postmaster..
> ...
> And when I saw a flood of semop's for any particular PID, a second later
> in the 'topas' process list would show that PID at a 100% CPU ...

So apparently we've still got a problem with multiprocess contention for
an LWLock somewhere.  It's not the BufMgrLock because that's gone in 8.1.
It could be one of the finer-grain locks that are still there, or it
could be someplace else.

Are you in a position to try your workload using PG CVS tip?  There's a
nontrivial possibility that we've already fixed this --- a couple months
ago I did some work to reduce contention in the lock manager:

2005-12-11 16:02  tgl

    * src/: backend/access/transam/twophase.c,
    backend/storage/ipc/procarray.c, backend/storage/lmgr/README,
    backend/storage/lmgr/deadlock.c, backend/storage/lmgr/lock.c,
    backend/storage/lmgr/lwlock.c, backend/storage/lmgr/proc.c,
    include/storage/lock.h, include/storage/lwlock.h,
    include/storage/proc.h: Divide the lock manager's shared state into
    'partitions', so as to reduce contention for the former single
    LockMgrLock.  Per my recent proposal.  I set it up for 16
    partitions, but on a pgbench test this gives only a marginal
    further improvement over 4 partitions --- we need to test more
    scenarios to choose the number of partitions.

This is unfortunately not going to help you as far as getting that
machine into production now (unless you're brave enough to run CVS tip
as production, which I certainly am not).  I'm afraid you're most likely
going to have to ship that pSeries back at the end of the month, but
while you've got it it'd be awfully nice if we could use it as a testbed
...

            regards, tom lane

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
"Luke Lonergan"
Date:
Gavin,

On 4/7/06 3:27 PM, "Gavin Hamill" <gdh@laterooms.com> wrote:

> 278774: __semop(15728650, 0x0FFFFFFFFFFF7E80, 1)        = 0
> 155712: __semop(15728650, 0x0FFFFFFFFFFF5920, 1)        = 0
> 278774: __semop(15728649, 0x0FFFFFFFFFFF6F10, 1)
> 114914: __semop(15728649, 0x0FFFFFFFFFFF6A40, 1)        = 0     = 0
> 114914: __semop(15728650, 0x0FFFFFFFFFFF61E0, 1)
> 155712: __semop(15728650, 0x0FFFFFFFFFFF6850, 1)        = 0     = 0
> 155712: __semop(15728650, 0x0FFFFFFFFFFF6890, 1)        = 0 1
> 55712: __semop(15728650, 0x0FFFFFFFFFFF5920, 1)
> 278774: __semop(15728650, 0x0FFFFFFFFFFF6F10, 1)
> 155712: __semop(15728650, 0x0FFFFFFFFFFF6850, 1)        = 0     = 0
> 278774: __semop(15728649, 0x0FFFFFFFFFFF7E40, 1)
> 114914: __semop(15728649, 0x0FFFFFFFFFFF6A80, 1)        = 0     = 0
> 278774: __semop(15728650, 0x0FFFFFFFFFFF7E80, 1)

Seems like you're hitting a very small target in RAM with these semop calls.
I wonder what part of the code is doing this - Tom would know better how to
trace it, but the equivalent of oprofile output would be nice.

The other thing that I'd like to see is an evaluation of the memory access
latency of this machine from Register to RAM.  I couldn't find a
benchmarking tool that was UNIX friendly out there, maybe I'll write one
real quick.  I suspect this machine has a heinous latency and a storm of
semops to the same spot of RAM might be a far worse performance problem on
this machine than on others...

- Luke





Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> On 4/7/06 3:27 PM, "Gavin Hamill" <gdh@laterooms.com> wrote:

>> 278774: __semop(15728650, 0x0FFFFFFFFFFF7E80, 1)        = 0
>> 155712: __semop(15728650, 0x0FFFFFFFFFFF5920, 1)        = 0
>> 278774: __semop(15728649, 0x0FFFFFFFFFFF6F10, 1)

> Seems like you're hitting a very small target in RAM with these semop calls.

IIRC the address passed to semop() in our code is always a local struct
on the stack, so that's a bit of a red herring --- there won't be
cross-processor contention for that.

It's plausible though that we are seeing contention across members of
the LWLock array, with the semop storm just being a higher-level symptom
of the real hardware-level problem.  You might try increasing
LWLOCK_PADDED_SIZE to 64 or even 128, see
src/backend/storage/lmgr/lwlock.c (this is something that does exist in
8.1, so it'd be easy to try).

            regards, tom lane

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
On Fri, 07 Apr 2006 18:52:20 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Are you in a position to try your workload using PG CVS tip?  There's
> a nontrivial possibility that we've already fixed this --- a couple
> months ago I did some work to reduce contention in the lock manager:

Well, there's a question. At the moment it's still live - but I'll need
to swap back to the Xeon machine since I can't afford to have a Saturday
with the db firing on three cylinders (out of eight :)

At that point you're welcome to twiddle, compile, throw anything you
want at it. If it helps us as much as the greater pg world, then that's
perfect.

> This is unfortunately not going to help you as far as getting that
> machine into production now (unless you're brave enough to run CVS tip
> as production, which I certainly am not).

.. if the problem can actually be boiled down to the locking/threading
issues, surely it should be straightforward to backport those changes
to 8.1.3 mainline?

> I'm afraid you're most
> likely going to have to ship that pSeries back at the end of the
> month, but while you've got it it'd be awfully nice if we could use
> it as a testbed ...

We have it for the next 2 weeks, and whilst I can't guarantee access for
all that time, you're welcome to hammer away at it over this weekend if
that's any help? Mail me privately and I'll sort out login details if
this is interesting.

Cheers,
Gavin.

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
On Fri, 07 Apr 2006 15:56:52 -0700
"Luke Lonergan" <llonergan@greenplum.com> wrote:

> Seems like you're hitting a very small target in RAM with these semop
> calls. I wonder what part of the code is doing this - Tom would know
> better how to trace it, but the equivalent of oprofile output would
> be nice.

I'm happy to test whatever I can, but I simply don't know enough AIX to
be able to tell whether a similar kernel-level profiler is
available/possible.

> The other thing that I'd like to see is an evaluation of the memory
> access latency of this machine from Register to RAM.  I couldn't find
> a benchmarking tool that was UNIX friendly out there, maybe I'll
> write one real quick.  I suspect this machine has a heinous latency
> and a storm of semops to the same spot of RAM might be a far worse
> performance problem on this machine than on others...

Well, as I said to Tom, the machine is available for running tests
on :) If it helps us, and helps pg become more AIX friendly, then I'm
all for whatever needs done...

Cheers,
Gavin.

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Richard Huxton
Date:
Gavin Hamill wrote:
> On Fri, 07 Apr 2006 15:24:18 -0500
> Scott Marlowe <smarlowe@g2switchworks.com> wrote:
>
>>> See reply to Tom Lane :)
>> I didn't see one go by yet...  Could be sitting in the queue.
>
> If it's not arrived by now - EXPLAIN ANALYZE doesn't tell me
> anything :)
>
>> Let us know if changing the fsync setting helps.  Hopefully that's all
>> the problem is.
>
> fsync's already off - yes a bit scary, but our I/O is only about
> 500KB/sec writing.. the whole db fits in RAM / kernel disk cache, and
> I'd rather have performance than security at this exact moment..

>>> Multi-Opteron was the other thing we considered but decided to give
>>> 'Big Iron' UNIX a whirl...
>> It still might be a good choice, if it's a simple misconfiguration
>> issue.
>>
>> But man, those new multiple core opterons can make some impressive
>> machines for very little money.
>
> So I see - we could buy two quad-opterons for the cost of renting this
> pSeries for a month....

I don't know about the pSeries, but I had a client the other week with
the same usage pattern. They made the switch from a quad Xeon to dual
(+dual-core) Opteron and were extremely impressed.

I can probably put you in touch with their sysadmin - contact me
off-list if you'd like that.

--
   Richard Huxton
   Archonet Ltd

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Simon Riggs
Date:
On Fri, 2006-04-07 at 19:05 -0400, Tom Lane wrote:

> It's plausible though that we are seeing contention across members of
> the LWLock array, with the semop storm just being a higher-level symptom
> of the real hardware-level problem.  You might try increasing
> LWLOCK_PADDED_SIZE to 64 or even 128, see
> src/backend/storage/lmgr/lwlock.c (this is something that does exist in
> 8.1, so it'd be easy to try).

pSeries cache lines are 128 bytes wide, so I'd go straight to 128.

If you're renting all 8 CPUs, I'd drop to 4 and try that instead. With 8
CPUs the contention will vary according to what each CPU is doing at any
one time - when they all hit the contention spot, things will get worse.

The pSeries has good CPUs and great caching, so I'd expect contention to
be somewhat more apparent as a bottleneck.

Best Regards, Simon Riggs


Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
Simon Riggs wrote:

> pSeries cache lines are 128 bytes wide, so I'd go straight to 128.
>
Hello :)

OK, that line of code is:

#define LWLOCK_PADDED_SIZE      (sizeof(LWLock) <= 16 ? 16 : 32)

What should I change this to? I don't understand the syntax of the <= 16
? : stuff...

would a simple  "#define LWLOCK_PADDED_SIZE  128" be sufficient?

>If you're renting all 8 CPUs, I'd drop to 4 and try that instead. With 8
>CPUs the contention will vary according to what each CPU is doing at any
>one time - when they all hit the contention spot, things will get worse.
>
>
>
We have a physical machine installed in our rack at the data centre,
rather than renting a virtual partition of a real machine... I'm not
sure how to enable/disable CPUs even with the help of 'smitty' :)

>The pSeries has good CPUs and great caching, so I'd expect contention to
>be somewhat more apparent as a bottleneck.
>
>
>
Yep, I expected 32MB of 'L3' cache would yield impressive results :)

Cheers,
Gavin.



Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Tom Lane
Date:
Gavin Hamill <gdh@laterooms.com> writes:
> would a simple  "#define LWLOCK_PADDED_SIZE  128" be sufficient?

Yeah, that's fine.

            regards, tom lane

Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Brad Nicholson
Date:
Tom Lane wrote:

> This is unfortunately not going to help you as far as getting that
> machine into production now (unless you're brave enough to run CVS tip
> as production, which I certainly am not).  I'm afraid you're most likely
> going to have to ship that pSeries back at the end of the month, but
> while you've got it it'd be awfully nice if we could use it as a testbed

We have PSeries boxes here that won't be going away anytime soon.  If
there are any specific test cases that need to run, I should be able to
find the time to do it.

--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.


Re: pg 8.1.3, AIX, huge box, painfully slow.

From
Gavin Hamill
Date:
Tom Lane wrote:

>Gavin Hamill <gdh@laterooms.com> writes:
>
>
>>would a simple  "#define LWLOCK_PADDED_SIZE  128" be sufficient?
>>
>>
>
>Yeah, that's fine.
>
>

OK I tried that but noticed no real improvement... in the interim I've
installed Debian on the pSeries (using
http://debian.gonicus.de/debian/dists/sarge/main/disks-powerpc/current/pseries/install.txt
) and using a simple load-test script - it picks a 'hotelsearch' select
at random from a big file and just does a pg_query on that via PHP...

Using apachebench with 10 clients gave a loadavg of about 10 after a few
minutes, and the logs showed typical query times of 8 seconds. Again, no
disk activity, normal context-switching, just full-out CPU usage...

We're improving the quality + efficiency of the hotelsearch function all
the time (Simon will certainly be able to vouch for its complexity) - am
really uncertain what to do next tho! :/

Cheers,
Gavin.