Thread: OSX & Performance

OSX & Performance

From
Jeff Trout
Date:
Well folks, I've been trying to track down why this Athlon 2800
(2.1ghz) has been handing my 2.5ghz G5 its cake.  I have a query that
(makes no io - the dataset can live in ram easily) takes about 700ms
on the athlon and about 10 seconds on the G5.

Tracking ti down a bit timestamp_cmp_internal (The btree was made of
a timestamp & and int) was taking a large amount of time -
specifically all the calls it makes to isnan(x).  14.1% in __isnand
(which is the libSystem function & guts, which according to the
darwin source copies the double to memory and accesses it as 2 ints
looking for a specific pattern).  (For reference, the other top
functions are _bt_checkkeys at 30%, FunctionCall2 at 15.8% , _bt_step
at 9% and _bt_first at 7%) .

Talking to some of the mac super guru's on irc they said the problem
is how the Mach-O ABI works, basically you get kicked in the nuts for
accessing global or static data (like those constants __isnand
uses).  (You can read http://www.unsanity.org/archives/000044.php for
a touch of info on it).

I think given the function-call-rich arch of PG may make its
performance on OSX always lower than other counterparts. Especially
things like that __isnand.

I'm going to be doing a couple experiments: 1. making an inline
version of isnan to see how that improves performance  2. Trying it
out on linux ppc to see how it runs.   It may be worth noting these
in the docs or faq somewhere.

Also, two things to note, one of which is quite important: On tiger
(10.4) PG compiles with NO OPTIMIZATION.  Probably a template file
needs to be updated.
Panther seems to compile with -O2 though.

If you want to profile PG on Tiger do not use gprof - it seems to be
broken.  I get func call #s, but no timing data.  Instead you can do
something even better - compile PG normally and attach to it with
Shark (Comes with the CHUD tools) and check out its profile.  Quite
slick actually :)

I'll keep people updated on my progress, but I just wanted to get
these issues out in the air.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/



Re: OSX & Performance

From
Tom Lane
Date:
Jeff Trout <threshar@torgo.978.org> writes:
> Tracking ti down a bit timestamp_cmp_internal (The btree was made of
> a timestamp & and int) was taking a large amount of time -
> specifically all the calls it makes to isnan(x).  14.1% in __isnand

Hmm, can you provide a test case for other people to poke at?

> Also, two things to note, one of which is quite important: On tiger
> (10.4) PG compiles with NO OPTIMIZATION.  Probably a template file
> needs to be updated.
> Panther seems to compile with -O2 though.

I see -O2 when building PG (CVS tip) on a fully up-to-date 10.4.2
machine.  Maybe something odd in your environment, like a preset
CFLAGS setting?

            regards, tom lane

Re: OSX & Performance

From
Jeff Trout
Date:
On Aug 28, 2005, at 4:00 PM, Tom Lane wrote:

>
> Hmm, can you provide a test case for other people to poke at?
>

I'l try to put one together as small as I can make it.
The table in question is roughly 22M rows.  There are about 8k rows
per timestamp (day granularity).

> I see -O2 when building PG (CVS tip) on a fully up-to-date 10.4.2
> machine.  Maybe something odd in your environment, like a preset
> CFLAGS setting?
>

8.0.3 doesn't have any optimization flags
8.1beta1 doesn't have any optimization
ie: gcc -no-cpp-precomp -Wall -Wmissing-prototypes -Wpointer-arith -
Wdeclaration-after-statement -Wold-style-definition -Wendif-labels -
fno-strict-aliasing  -I../../src/port  -I../../src/include  -c
thread.c -o thread_srv.o

I'm on 10.4.2, xcode 2.1
Using built-in specs.
Target: powerpc-apple-darwin8
Configured with: /private/var/tmp/gcc/gcc-5026.obj~19/src/configure --
disable-checking --prefix=/usr --mandir=/share/man --enable-
languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^+.-]*$/
s/$/-4.0/ --with-gxx-include-dir=/include/gcc/darwin/4.0/c++ --
build=powerpc-apple-darwin8 --host=powerpc-apple-darwin8 --
target=powerpc-apple-darwin8
Thread model: posix
gcc version 4.0.0 (Apple Computer, Inc. build 5026)

The snapshot on ftp.psotgresql.org (dated 8/29) also runs with no
optimization.

No cflags are set.

need to see anything from config.log?

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/



Re: OSX & Performance

From
Tom Lane
Date:
Jeff Trout <threshar@torgo.978.org> writes:
> On Aug 28, 2005, at 4:00 PM, Tom Lane wrote:
>> I see -O2 when building PG (CVS tip) on a fully up-to-date 10.4.2
>> machine.  Maybe something odd in your environment, like a preset
>> CFLAGS setting?

> 8.0.3 doesn't have any optimization flags
> 8.1beta1 doesn't have any optimization
> ie: gcc -no-cpp-precomp -Wall -Wmissing-prototypes -Wpointer-arith -
> Wdeclaration-after-statement -Wold-style-definition -Wendif-labels -
> fno-strict-aliasing  -I../../src/port  -I../../src/include  -c
> thread.c -o thread_srv.o

You must have CFLAGS set to empty in your build environment, because
configure will certainly default to -O2 if not overridden.  It works
fine for me on OS X.  Maybe you want to trace through the configure
script and see why it's doing something else?

            regards, tom lane

Re: OSX & Performance

From
Jeff Trout
Date:
On Aug 29, 2005, at 1:57 PM, Tom Lane wrote:
>
> You must have CFLAGS set to empty in your build environment, because
> configure will certainly default to -O2 if not overridden.  It works
> fine for me on OS X.  Maybe you want to trace through the configure
> script and see why it's doing something else?
>

/me hangs head in shame.

Yes. I'd been futzing with various settings and had CFLAGS set to
empty instead of cleared out.   8.0.3 and -snapshot (8/29) both seem
to now compile with -O2

Anyway, I tried putting together a nice self-data-producing test case
but that didn't cause the bug.  So I'm trying to get this dump as
small as possible (I'll email you a url later).

To tide things over, here's the gprof (and shark) output for my query
of doom.

linux box:

   6.36      0.41     0.41   240694     0.00     0.00  _bt_compare
   5.97      0.79     0.38   907242     0.00     0.00  AllocSetAlloc
   4.55      1.07     0.29   135008     0.00     0.00  hash_any
   4.16      1.34     0.27   185684     0.00     0.00
MemoryContextAllocZeroAlig
ned
   3.30      1.55     0.21    39152     0.01     0.01  localsub
   2.98      1.74     0.19  1213172     0.00     0.00  AllocSetFreeIndex
   2.83      1.92     0.18    52695     0.00     0.00  nocachegetattr
   2.75      2.10     0.17   134775     0.00     0.00  hash_search
   2.51      2.25     0.16    47646     0.00     0.01
StrategyBufferLookup
   2.28      2.40     0.14    71990     0.00     0.00  fmgr_isbuiltin
   2.20      2.54     0.14    33209     0.00     0.00  _bt_moveright
   1.88      2.66     0.12    78864     0.00     0.00  comparetup_heap
   1.57      2.76     0.10    63485     0.00     0.00  SearchCatCache
   1.41      2.85     0.09    39152     0.00     0.00  timesub
   1.26      2.93     0.08   325246     0.00     0.00  tas
   1.26      3.01     0.08   305883     0.00     0.00  AllocSetFree
   1.26      3.09     0.08   162622     0.00     0.00  LWLockAcquire

and on osx: (self, total, library, func)

     29.0%    29.0%    postmaster    _bt_checkkeys
     15.6%    15.6%    postmaster    FunctionCall2
     10.4%    10.4%    libSystem.B.dylib    __isnand
     9.5%    9.5%    postmaster    timestamp_cmp_internal
     9.3%    9.3%    postmaster    _bt_step
     5.3%    5.3%    postmaster    timestamp_le
     4.9%    4.9%    postmaster    _bt_next
     3.6%    3.6%    postmaster    dyld_stub___isnand
     3.1%    3.1%    postmaster    timestamp_gt
     1.9%    1.9%    postmaster    int4eq
     1.3%    1.3%    postmaster    BufferGetBlockNumber
     0.6%    0.6%    postmaster    LWLockAcquire
     0.5%    0.5%    postmaster    LWLockRelease
     0.4%    0.4%    postmaster    hash_search

On my failed simulated attempt here's what things looked liek (the
data should have been relatively similar).

linux:

   5.39      0.28     0.28   852086     0.00     0.00  AllocSetAlloc
   4.90      0.53     0.25   130165     0.00     0.00  hash_any
   4.12      0.73     0.21   214061     0.00     0.00  _bt_compare
   4.12      0.94     0.21    39152     0.01     0.01  localsub
   4.02      1.15     0.20   160487     0.00     0.00
MemoryContextAllocZeroAlig
ned
   3.24      1.31     0.17  1157316     0.00     0.00  AllocSetFreeIndex
   3.14      1.48     0.16    64375     0.00     0.00  fmgr_isbuiltin
   2.55      1.60     0.13    56142     0.00     0.00  SearchCatCache
   2.35      1.73     0.12   130076     0.00     0.00  hash_search
   1.76      1.81     0.09    39152     0.00     0.00  timesub
   1.67      1.90     0.09   221469     0.00     0.00
timestamp_cmp_internal
   1.67      1.99     0.09    56069     0.00     0.00
MemoryContextCreate
   1.57      2.06     0.08   145787     0.00     0.00  LWLockRelease
   1.37      2.13     0.07   289119     0.00     0.00  pfree
   1.37      2.21     0.07     8002     0.01     0.02
ExecMakeFunctionResult
   1.37      2.27     0.07     8000     0.01     0.22  ExecInitIndexScan
   1.18      2.33     0.06   291574     0.00     0.00  tas

and on osx: (which runs very fast, usually a couple hundred ms faster
than the linux box)

     5.9%    5.9%    postmaster    LWLockAcquire
     5.2%    5.2%    postmaster    AllocSetAlloc
     4.9%    4.9%    postmaster    LWLockRelease
     3.9%    3.9%    postmaster    hash_any
     3.6%    3.6%    postmaster    _bt_compare
     2.9%    2.9%    postmaster    hash_search
     2.6%    2.6%    postmaster    MemoryContextAllocZeroAligned
     2.6%    2.6%    postmaster    ExecInitExpr
     2.0%    2.0%    mach_kernel    ml_set_interrupts_enabled
     2.0%    2.0%    postmaster    fmgr_info_cxt_security
     2.0%    2.0%    postmaster    AllocSetFree
     1.6%    1.6%    postmaster    MemoryContextAlloc
     1.6%    1.6%    postmaster    FunctionCall2
     1.6%    1.6%    postmaster    AllocSetDelete
     1.6%    1.6%    libSystem.B.dylib    __isnand

which to me anyway, looks like basically the same profile.
So there must be something about the exact nature of hte data that is
kicking it in the nuts.

I tried making a copy of hte table using select into,  I get the same
performace. Clustered on the index.. same hting.

The table is a timestamp (no tz), 2 ints and 4 doubles.  The index is
on (timestamp, int1)

As I said before, I'll send a url along to the dump once it has
dumped and I get it somewhere good (unless I get my test data
generator to invoke this problem).  I could also get you access to
this machine, but be warned gprof on tiger is pretty useless from
what I've seen.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/