Thread: Xeon twice the performance of opteron

Xeon twice the performance of opteron

From
Jeff
Date:
hey folks,

Running into some odd performance issues between a few of our db
boxes.  While trying to speed up a query I ran it on another box and
it was twice as fast.  The plans are identical and various portions of
the query run in the same amount of time - it all boils down to most
of the time being spent in a join filter.  The plan is as good as it
is going to get but the thing that is concerning me, which hopefully
some folks here may have some insight on, is the very large difference
in runtime.

three boxes:
    A: Intel(R) Xeon(R) CPU           E5345  @ 2.33GHz  (Runs query
fastest)
        4MB cache
    B: Quad-Core AMD Opteron(tm) Processor 2352 (2.1GHZ) (Main production
box, currently, middle speed)
        512k cache
    C: Quad-Core AMD Opteron(tm) Processor 2378 (2.4GHZ)
        512k cache

A & B are running PG 8.4.2 (yes, I know it desperately need to be
upgraded). C was also on 8.4.2 and since it was not in production I
upgraded it to 8.4.7 and got the same performance as 8.4.2.  Dataset
on A & B is the same C is mostly the same, but is missing a couple
weeks of data (but since this query runs over 3 years of data, it is
negligable - plus C runs the slowest!)

All three running FC10 with kernel Linux db06
2.6.27.19-170.2.35.fc10.x86_64 #1 SMP Mon Feb 23 13:00:23 EST 2009
x86_64 x86_64 x86_64 GNU/Linux

Load is very low on each box. The query is running from shared_buffers
- no real IO is occuring.

The average timing for the query in question is 90ms on A,  180ms on B
and 190ms on C.

Now here's where some odd stuff starts piling up: explain analyze
overhead on said queries:
20ms on A, 50ms on B and 85ms on C(!!)

We had one thought about potential NUMA issues, but doing a series
(100) of connect, query, disconnect and looking at the timings reveals
them all to be solid... but even still we wouldn't expect it to be
that awful.  The smaller cache of the opterons is also a valid argument.

I know we're running an old kernel, I'm tempted to upgrade to see what
will happen, but at the same time I'm afraid it'll upgrade to a kernel
with a broken [insert major subsystem here] which has happened before.

Anybody have some insight into this or run into this before?

btw, little more background on the query:

    ->  Nested Loop  (cost=5.87..2763.69 rows=9943 width=0) (actual
time=0.571..2
74.750 rows=766 loops=1)
          Join Filter: (ce.eventdate >= (md.date - '6 days'::interval))
          ->  Nested Loop  (cost=5.87..1717.98 rows=27 width=8)
(actual time=0.53
3..8.301 rows=159 loops=1)
        [stuff removed here]
         ->  Index Scan using  xxxxxxx_date_idx on xxxxxx md
(cost=0.00..19.50 rows=1099 width=8) (actual time=0.023..0.729
rows=951 loops=15
9)
                Index Cond: (ce.eventdate <= md.date)


On all three boxes that inner nestloop completes in about the same
amount of time - it is that join filter that is causing the pain and
agony. (If you are noticing the timing differences, that is because
the numbers above are the actual numbers, not explain analyze).  The
query is pulling up a rolling window of events that occured on a
specific date. This query pulls up al the data for a period of time.
ce.eventdate is indexed, and is used in the outer nestloop.  Thinking
more about what is going on cache thrashing is certainly a possibility.

the amazing explain analyze overhead is also very curious - we all
know it adds overhead, but 85ms?  Yow.

--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




Re: Xeon twice the performance of opteron

From
J Sisson
Date:
On Thu, Mar 17, 2011 at 10:13 AM, Jeff <threshar@torgo.978.org> wrote:
> hey folks,
>
> Running into some odd performance issues between a few of our db boxes.

We've noticed similar results both in OLTP and data warehousing conditions here.

Opteron machines just seem to lag behind *especially* in data
warehousing.  Smaller
cache for sorting/etc... is what I'd always chalked it up to, but I'm
open to other theories
if they exist.

Re: Xeon twice the performance of opteron

From
Claudio Freire
Date:
On Thu, Mar 17, 2011 at 1:42 PM, J Sisson <sisson.j@gmail.com> wrote:
> On Thu, Mar 17, 2011 at 10:13 AM, Jeff <threshar@torgo.978.org> wrote:
>> hey folks,
>>
>> Running into some odd performance issues between a few of our db boxes.
>
> We've noticed similar results both in OLTP and data warehousing conditions here.
>
> Opteron machines just seem to lag behind *especially* in data
> warehousing.  Smaller
> cache for sorting/etc... is what I'd always chalked it up to, but I'm
> open to other theories
> if they exist.

It's my theory as well - you know, this could be solved by JITting
complex expressions.

Bad cache behavior in application often comes as a side-effect of
interpreted execution (in this case, of expressions, conditions,
functions). A JIT usually solves this cache inefficiency.

I know, adding any kind of JIT to pg could be a major task.

Re: Xeon twice the performance of opteron

From
Craig James
Date:
On 3/17/11 9:42 AM, J Sisson wrote:
> On Thu, Mar 17, 2011 at 10:13 AM, Jeff<threshar@torgo.978.org>  wrote:
>> hey folks,
>>
>> Running into some odd performance issues between a few of our db boxes.
> We've noticed similar results both in OLTP and data warehousing conditions here.
>
> Opteron machines just seem to lag behind *especially* in data
> warehousing.  Smaller
> cache for sorting/etc... is what I'd always chalked it up to, but I'm
> open to other theories
> if they exist.
We had a similar result with a different CPU-intensive open-source package, and discovered that if we compiled it on
theOpteron it ran almost twice as fast as binaries compiled on Intel hardware. We thought we could compile once, run
everywhere,but it's not true.  It must have been some specific optimization difference between Intel and AMD that the
gcccompiler knows about.  I don't know if that's the case here, but it's a thought. 

Craig

Re: Xeon twice the performance of opteron

From
"mark"
Date:

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Jeff
> Sent: Thursday, March 17, 2011 9:14 AM
> To: pgsql-performance@postgresql.org
> Cc: Brian Ristuccia
> Subject: [PERFORM] Xeon twice the performance of opteron
>
> hey folks,
>
> Running into some odd performance issues between a few of our db
> boxes.  While trying to speed up a query I ran it on another box and
> it was twice as fast.  The plans are identical and various portions of
> the query run in the same amount of time - it all boils down to most
> of the time being spent in a join filter.  The plan is as good as it
> is going to get but the thing that is concerning me, which hopefully
> some folks here may have some insight on, is the very large difference
> in runtime.
>
> three boxes:
>     A: Intel(R) Xeon(R) CPU           E5345  @ 2.33GHz  (Runs query
> fastest)
>         4MB cache
>     B: Quad-Core AMD Opteron(tm) Processor 2352 (2.1GHZ) (Main
> production
> box, currently, middle speed)
>         512k cache
>     C: Quad-Core AMD Opteron(tm) Processor 2378 (2.4GHZ)
>         512k cache
>
> A & B are running PG 8.4.2 (yes, I know it desperately need to be
> upgraded). C was also on 8.4.2 and since it was not in production I
> upgraded it to 8.4.7 and got the same performance as 8.4.2.  Dataset
> on A & B is the same C is mostly the same, but is missing a couple
> weeks of data (but since this query runs over 3 years of data, it is
> negligable - plus C runs the slowest!)
>
> All three running FC10 with kernel Linux db06
> 2.6.27.19-170.2.35.fc10.x86_64 #1 SMP Mon Feb 23 13:00:23 EST 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
> Load is very low on each box. The query is running from shared_buffers
> - no real IO is occuring.
>
> The average timing for the query in question is 90ms on A,  180ms on B
> and 190ms on C.
>
> Now here's where some odd stuff starts piling up: explain analyze
> overhead on said queries:
> 20ms on A, 50ms on B and 85ms on C(!!)
>
> We had one thought about potential NUMA issues, but doing a series
> (100) of connect, query, disconnect and looking at the timings reveals
> them all to be solid... but even still we wouldn't expect it to be
> that awful.  The smaller cache of the opterons is also a valid
> argument.
>
> I know we're running an old kernel, I'm tempted to upgrade to see what
> will happen, but at the same time I'm afraid it'll upgrade to a kernel
> with a broken [insert major subsystem here] which has happened before.
>
> Anybody have some insight into this or run into this before?
>
> btw, little more background on the query:
>
>     ->  Nested Loop  (cost=5.87..2763.69 rows=9943 width=0) (actual
> time=0.571..2
> 74.750 rows=766 loops=1)
>           Join Filter: (ce.eventdate >= (md.date - '6 days'::interval))
>           ->  Nested Loop  (cost=5.87..1717.98 rows=27 width=8)
> (actual time=0.53
> 3..8.301 rows=159 loops=1)
>         [stuff removed here]
>          ->  Index Scan using  xxxxxxx_date_idx on xxxxxx md
> (cost=0.00..19.50 rows=1099 width=8) (actual time=0.023..0.729
> rows=951 loops=15
> 9)
>                 Index Cond: (ce.eventdate <= md.date)
>
>
> On all three boxes that inner nestloop completes in about the same
> amount of time - it is that join filter that is causing the pain and
> agony. (If you are noticing the timing differences, that is because
> the numbers above are the actual numbers, not explain analyze).  The
> query is pulling up a rolling window of events that occured on a
> specific date. This query pulls up al the data for a period of time.
> ce.eventdate is indexed, and is used in the outer nestloop.  Thinking
> more about what is going on cache thrashing is certainly a possibility.
>
> the amazing explain analyze overhead is also very curious - we all
> know it adds overhead, but 85ms?  Yow.
>
> --
> Jeff Trout <jeff@jefftrout.com>
> http://www.stuarthamm.net/
> http://www.dellsmartexitin.com/

I am sure you might have already checked for this, but just incase...
Did you verify that no power savings stuff is turned on in the BIOS or at
the kernel ?

I have to set ours to something HP calls static high performance or
something like that if I want boxes that are normally pretty idle to execute
in a predictable fashion for sub second queries.

I assume you checked with a steam benchmark results on the AMD machines to
make sure they are getting in the ballpark of where they are supposed to ?







Re: Xeon twice the performance of opteron

From
Scott Marlowe
Date:
On Thu, Mar 17, 2011 at 9:13 AM, Jeff <threshar@torgo.978.org> wrote:
> hey folks,
>
> Running into some odd performance issues between a few of our db boxes.
>  While trying to speed up a query I ran it on another box and it was twice
> as fast.  The plans are identical and various portions of the query run in
> the same amount of time - it all boils down to most of the time being spent
> in a join filter.  The plan is as good as it is going to get but the thing
> that is concerning me, which hopefully some folks here may have some insight
> on, is the very large difference in runtime.

My experience puts the 23xx series opterons in a same general
neighborhood as the E5300 and a little behind the E5400 series Xeons.
OTOH, the newer Magny Cours Opterons stomp both of those into the
ground.

Do any of those machines have zone.reclaim.mode = 1 ???

i.e.:

sysctl -a|grep zone.reclaim
vm.zone_reclaim_mode = 0

I had a machine that had just high enough interzone communications
cost to get it turned on by default and it slowed it right to a crawl
under pgsql.

Re: Xeon twice the performance of opteron

From
Jeff
Date:
On Mar 17, 2011, at 9:39 PM, Scott Marlowe wrote:

>
> My experience puts the 23xx series opterons in a same general
> neighborhood as the E5300 and a little behind the E5400 series Xeons.
> OTOH, the newer Magny Cours Opterons stomp both of those into the
> ground.
>
> Do any of those machines have zone.reclaim.mode = 1 ???
>
> i.e.:
>
> sysctl -a|grep zone.reclaim
> vm.zone_reclaim_mode = 0
>
> I had a machine that had just high enough interzone communications
> cost to get it turned on by default and it slowed it right to a crawl
> under pgsql.


It is set to zero on this machine.

I've tried PG compiled on the box itself, same result.

As for power savings, according to cpuinfo all the cores are running
at 2.1ghz

We had another machine which typically runs as a web server running on
an AMD Opteron(tm) Processor 6128
which after diddling the speed governor to performance (thus bumping
cpu speed to 2ghz from 800mhz) query speed increased to 100ms, still
not as fast as the xeon, but close enough.

I think I'm just hitting some wall of the architecture. I tried
getting some oprofile love from it but oprofile seems to not work on
that box. however it worked on the xeon box:
33995     9.6859  postgres                 j2date
21925     6.2469  postgres                 ExecMakeFunctionResultNoSets
20500     5.8409  postgres                 slot_deform_tuple
17623     5.0212  postgres                 BitmapHeapNext
13059     3.7208  postgres                 dt2time
12271     3.4963  postgres                 slot_getattr
11509

aside from j2date (probably coming up due to that Join filter I'd
wager) nothing unexpected.


--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




Re: Xeon twice the performance of opteron

From
Greg Smith
Date:
On 03/17/2011 11:13 AM, Jeff wrote:
> three boxes:
>     A: Intel(R) Xeon(R) CPU           E5345  @ 2.33GHz  (Runs query
> fastest)
>         4MB cache
>     B: Quad-Core AMD Opteron(tm) Processor 2352 (2.1GHZ) (Main
> production box, currently, middle speed)
>         512k cache
>     C: Quad-Core AMD Opteron(tm) Processor 2378 (2.4GHZ)
>         512k cache

It's possible that transfer speed between the CPU and memory are very
different between these systems when running a single-core operation.
Intel often has an advantage there; I don't have any figures on this
generation of processors to know for sure though.  If you can get some
idle time to run my stream-scaling tool from
https://github.com/gregs1104/stream-scaling that might give you some
insight.

> Now here's where some odd stuff starts piling up: explain analyze
> overhead on said queries:
> 20ms on A, 50ms on B and 85ms on C(!!)

I found an example in my book where EXPLAIN ANALYZE took a trivial
COUNT(*) query from 8ms to 70ms.  It's really not cheap for some sorts
of things.

> I know we're running an old kernel, I'm tempted to upgrade to see what
> will happen, but at the same time I'm afraid it'll upgrade to a kernel
> with a broken [insert major subsystem here] which has happened before.

Running a production server on Fedora Core is a scary operation pretty
much all the time.  That said, I wouldn't consider 2.6.27 to be an old
kernel--not when RHEL5 is still using 2.6.18.  The kernel version you
get for FC10 is probably quite behind on updates, though, relative to a
kernel.org one that has kept getting bug fixes.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books