Thread: OSX & Performance
Well folks, I've been trying to track down why this Athlon 2800 (2.1ghz) has been handing my 2.5ghz G5 its cake. I have a query that (makes no io - the dataset can live in ram easily) takes about 700ms on the athlon and about 10 seconds on the G5. Tracking ti down a bit timestamp_cmp_internal (The btree was made of a timestamp & and int) was taking a large amount of time - specifically all the calls it makes to isnan(x). 14.1% in __isnand (which is the libSystem function & guts, which according to the darwin source copies the double to memory and accesses it as 2 ints looking for a specific pattern). (For reference, the other top functions are _bt_checkkeys at 30%, FunctionCall2 at 15.8% , _bt_step at 9% and _bt_first at 7%) . Talking to some of the mac super guru's on irc they said the problem is how the Mach-O ABI works, basically you get kicked in the nuts for accessing global or static data (like those constants __isnand uses). (You can read http://www.unsanity.org/archives/000044.php for a touch of info on it). I think given the function-call-rich arch of PG may make its performance on OSX always lower than other counterparts. Especially things like that __isnand. I'm going to be doing a couple experiments: 1. making an inline version of isnan to see how that improves performance 2. Trying it out on linux ppc to see how it runs. It may be worth noting these in the docs or faq somewhere. Also, two things to note, one of which is quite important: On tiger (10.4) PG compiles with NO OPTIMIZATION. Probably a template file needs to be updated. Panther seems to compile with -O2 though. If you want to profile PG on Tiger do not use gprof - it seems to be broken. I get func call #s, but no timing data. Instead you can do something even better - compile PG normally and attach to it with Shark (Comes with the CHUD tools) and check out its profile. Quite slick actually :) I'll keep people updated on my progress, but I just wanted to get these issues out in the air. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
Jeff Trout <threshar@torgo.978.org> writes: > Tracking ti down a bit timestamp_cmp_internal (The btree was made of > a timestamp & and int) was taking a large amount of time - > specifically all the calls it makes to isnan(x). 14.1% in __isnand Hmm, can you provide a test case for other people to poke at? > Also, two things to note, one of which is quite important: On tiger > (10.4) PG compiles with NO OPTIMIZATION. Probably a template file > needs to be updated. > Panther seems to compile with -O2 though. I see -O2 when building PG (CVS tip) on a fully up-to-date 10.4.2 machine. Maybe something odd in your environment, like a preset CFLAGS setting? regards, tom lane
On Aug 28, 2005, at 4:00 PM, Tom Lane wrote: > > Hmm, can you provide a test case for other people to poke at? > I'l try to put one together as small as I can make it. The table in question is roughly 22M rows. There are about 8k rows per timestamp (day granularity). > I see -O2 when building PG (CVS tip) on a fully up-to-date 10.4.2 > machine. Maybe something odd in your environment, like a preset > CFLAGS setting? > 8.0.3 doesn't have any optimization flags 8.1beta1 doesn't have any optimization ie: gcc -no-cpp-precomp -Wall -Wmissing-prototypes -Wpointer-arith - Wdeclaration-after-statement -Wold-style-definition -Wendif-labels - fno-strict-aliasing -I../../src/port -I../../src/include -c thread.c -o thread_srv.o I'm on 10.4.2, xcode 2.1 Using built-in specs. Target: powerpc-apple-darwin8 Configured with: /private/var/tmp/gcc/gcc-5026.obj~19/src/configure -- disable-checking --prefix=/usr --mandir=/share/man --enable- languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^+.-]*$/ s/$/-4.0/ --with-gxx-include-dir=/include/gcc/darwin/4.0/c++ -- build=powerpc-apple-darwin8 --host=powerpc-apple-darwin8 -- target=powerpc-apple-darwin8 Thread model: posix gcc version 4.0.0 (Apple Computer, Inc. build 5026) The snapshot on ftp.psotgresql.org (dated 8/29) also runs with no optimization. No cflags are set. need to see anything from config.log? -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
Jeff Trout <threshar@torgo.978.org> writes: > On Aug 28, 2005, at 4:00 PM, Tom Lane wrote: >> I see -O2 when building PG (CVS tip) on a fully up-to-date 10.4.2 >> machine. Maybe something odd in your environment, like a preset >> CFLAGS setting? > 8.0.3 doesn't have any optimization flags > 8.1beta1 doesn't have any optimization > ie: gcc -no-cpp-precomp -Wall -Wmissing-prototypes -Wpointer-arith - > Wdeclaration-after-statement -Wold-style-definition -Wendif-labels - > fno-strict-aliasing -I../../src/port -I../../src/include -c > thread.c -o thread_srv.o You must have CFLAGS set to empty in your build environment, because configure will certainly default to -O2 if not overridden. It works fine for me on OS X. Maybe you want to trace through the configure script and see why it's doing something else? regards, tom lane
On Aug 29, 2005, at 1:57 PM, Tom Lane wrote: > > You must have CFLAGS set to empty in your build environment, because > configure will certainly default to -O2 if not overridden. It works > fine for me on OS X. Maybe you want to trace through the configure > script and see why it's doing something else? > /me hangs head in shame. Yes. I'd been futzing with various settings and had CFLAGS set to empty instead of cleared out. 8.0.3 and -snapshot (8/29) both seem to now compile with -O2 Anyway, I tried putting together a nice self-data-producing test case but that didn't cause the bug. So I'm trying to get this dump as small as possible (I'll email you a url later). To tide things over, here's the gprof (and shark) output for my query of doom. linux box: 6.36 0.41 0.41 240694 0.00 0.00 _bt_compare 5.97 0.79 0.38 907242 0.00 0.00 AllocSetAlloc 4.55 1.07 0.29 135008 0.00 0.00 hash_any 4.16 1.34 0.27 185684 0.00 0.00 MemoryContextAllocZeroAlig ned 3.30 1.55 0.21 39152 0.01 0.01 localsub 2.98 1.74 0.19 1213172 0.00 0.00 AllocSetFreeIndex 2.83 1.92 0.18 52695 0.00 0.00 nocachegetattr 2.75 2.10 0.17 134775 0.00 0.00 hash_search 2.51 2.25 0.16 47646 0.00 0.01 StrategyBufferLookup 2.28 2.40 0.14 71990 0.00 0.00 fmgr_isbuiltin 2.20 2.54 0.14 33209 0.00 0.00 _bt_moveright 1.88 2.66 0.12 78864 0.00 0.00 comparetup_heap 1.57 2.76 0.10 63485 0.00 0.00 SearchCatCache 1.41 2.85 0.09 39152 0.00 0.00 timesub 1.26 2.93 0.08 325246 0.00 0.00 tas 1.26 3.01 0.08 305883 0.00 0.00 AllocSetFree 1.26 3.09 0.08 162622 0.00 0.00 LWLockAcquire and on osx: (self, total, library, func) 29.0% 29.0% postmaster _bt_checkkeys 15.6% 15.6% postmaster FunctionCall2 10.4% 10.4% libSystem.B.dylib __isnand 9.5% 9.5% postmaster timestamp_cmp_internal 9.3% 9.3% postmaster _bt_step 5.3% 5.3% postmaster timestamp_le 4.9% 4.9% postmaster _bt_next 3.6% 3.6% postmaster dyld_stub___isnand 3.1% 3.1% postmaster timestamp_gt 1.9% 1.9% postmaster int4eq 1.3% 1.3% postmaster BufferGetBlockNumber 0.6% 0.6% postmaster LWLockAcquire 0.5% 0.5% postmaster LWLockRelease 0.4% 0.4% postmaster hash_search On my failed simulated attempt here's what things looked liek (the data should have been relatively similar). linux: 5.39 0.28 0.28 852086 0.00 0.00 AllocSetAlloc 4.90 0.53 0.25 130165 0.00 0.00 hash_any 4.12 0.73 0.21 214061 0.00 0.00 _bt_compare 4.12 0.94 0.21 39152 0.01 0.01 localsub 4.02 1.15 0.20 160487 0.00 0.00 MemoryContextAllocZeroAlig ned 3.24 1.31 0.17 1157316 0.00 0.00 AllocSetFreeIndex 3.14 1.48 0.16 64375 0.00 0.00 fmgr_isbuiltin 2.55 1.60 0.13 56142 0.00 0.00 SearchCatCache 2.35 1.73 0.12 130076 0.00 0.00 hash_search 1.76 1.81 0.09 39152 0.00 0.00 timesub 1.67 1.90 0.09 221469 0.00 0.00 timestamp_cmp_internal 1.67 1.99 0.09 56069 0.00 0.00 MemoryContextCreate 1.57 2.06 0.08 145787 0.00 0.00 LWLockRelease 1.37 2.13 0.07 289119 0.00 0.00 pfree 1.37 2.21 0.07 8002 0.01 0.02 ExecMakeFunctionResult 1.37 2.27 0.07 8000 0.01 0.22 ExecInitIndexScan 1.18 2.33 0.06 291574 0.00 0.00 tas and on osx: (which runs very fast, usually a couple hundred ms faster than the linux box) 5.9% 5.9% postmaster LWLockAcquire 5.2% 5.2% postmaster AllocSetAlloc 4.9% 4.9% postmaster LWLockRelease 3.9% 3.9% postmaster hash_any 3.6% 3.6% postmaster _bt_compare 2.9% 2.9% postmaster hash_search 2.6% 2.6% postmaster MemoryContextAllocZeroAligned 2.6% 2.6% postmaster ExecInitExpr 2.0% 2.0% mach_kernel ml_set_interrupts_enabled 2.0% 2.0% postmaster fmgr_info_cxt_security 2.0% 2.0% postmaster AllocSetFree 1.6% 1.6% postmaster MemoryContextAlloc 1.6% 1.6% postmaster FunctionCall2 1.6% 1.6% postmaster AllocSetDelete 1.6% 1.6% libSystem.B.dylib __isnand which to me anyway, looks like basically the same profile. So there must be something about the exact nature of hte data that is kicking it in the nuts. I tried making a copy of hte table using select into, I get the same performace. Clustered on the index.. same hting. The table is a timestamp (no tz), 2 ints and 4 doubles. The index is on (timestamp, int1) As I said before, I'll send a url along to the dump once it has dumped and I get it somewhere good (unless I get my test data generator to invoke this problem). I could also get you access to this machine, but be warned gprof on tiger is pretty useless from what I've seen. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/