Thread: Promising results with Intel Linux x86 compiler
I've been playing around with Intel's x86 C++ compiler (icc) for linux. The compiler is very good for optimizing x86 code. With some struggle, I managed to get postgresql compiled with it. I've listed below what I had to do to get postgres compiled, along with some results from pgbench. Compilation icc can compile C code as well as C++. WRT C, icc is binary compatible with gcc. It aims to be compatible with gcc extensions, but has a ways to go. Note: I only compiled the backend with icc. Targets such as psql/ pgbench were compiled with gcc. * The first problem I encountered was icc couldn't produce the SUBSYS.o targets for the backend. The work around was to ignore the SUBSYS.o targets and link with each individual object files. * Linking doesn't appear to work with icc's "-ipo" optimization. The goal of ipo to perform inlining of functions between source files. This is a bummer, since -ipo can produce very good code. * Since icc can't handle inline assembly, a number of files need to be compiled with gcc. These include: access/transam/xlog.c storage/ipc/shmem.c storage/lmgr/proc.c storage/lmgr/lwlock.c storage/lmgr/s_lock.c utils/adt/pg_lzcompress.c Hopefully intel will add inline assembly to icc.... * I used the gcc frontend to ld instead of using icc directly. ld seems to hang (3+ hrs cpu time, no output) when invoked from icc on postgresql for some reason. Results: I produced two different backend executables. One with gcc and one with icc. I ran each executable on the same database and benchmarked with pgbench. (1) gcc 2.96 (yeah, RedHat 7.1) options: -Wall -O3 -fomit-frame-pointer -fforce-addr -fforce-mem -funroll-loops-malign-loops=2 -malign-functions=2 -malign-jumps=2 (2) icc 5.0 options: -O3 -tpp6 -xK -unroll -ip pgbench{i} [-t 5000] [-t 500 -c 10] [-t 200 -c 25] [-t 25 -c 50] (1)gcc 94.06 94.74 86.77 149.38 (2)icc 102.08 100.31 91.10 155.15 {i}: results are tps excluding connection establishing, average of 3 runs. A full vacuum/analyze was performed betweenruns. The results indicate up to a ~10% increase in transactions per second for pgbench. I've seen improvement more like 20% on some very cpu intensive programs (ie- lame mp3 encoder). If there are some other benchmarks easily run let me know and I'll give them a go. Side note: it seems difficult to get consistent results out of pgbench. I ended up dropping/recreating/repopulating the database between runs. I also modified pgbench to have a constant seed to the random number generator (attempting to get more consistent results). Conclusion The Intel compiler appears to produce code better than gcc 2.96 when testing with pgbench. My experience has been that icc excels at cpu-intensive processes, which might not be reflected in the pgbench results. Since postgresql can require lots of disk I/O, performance versus gcc will not be significant on processes already I/O bound. The build process currently requires lots of hand tweaking and isn't entirely possible without gcc. Future versions of icc should improve upon this. As such, it may be currently obtuse to give postgres support for icc out of the box but it should be doable if their is interest. My understanding is that the intel evaluation license is ok for hobbyists, but without purchasing an actual license ($500) the compiled code cannot be distributed. link: http://www.intel.com/software/products/compilers/c50/linux/noncom.htm Regards, Kyle kaf@_nwlink_._com_
Kyle <kaf@nwlink.com> writes: > * Linking doesn't appear to work with icc's "-ipo" optimization. The > goal of ipo to perform inlining of functions between source files. > This is a bummer, since -ipo can produce very good code. You should be quite wary of that one. The reason is that accesses to shared memory are typically protected by LWLockAcquire/LWLockRelease call pairs. It's absolutely critical that no operations get relocated into or out of the code segments between such call pairs. With interprocedural optimizations turned on, I think it's quite likely for a compiler to blow this --- which would lead to extremely nasty, low-probability, hard-to-debug failures during concurrent operation. Having recently tracked down some similar nastiness *within* LWLockAcquire (AIX's compiler feels no compunction about rearranging volatile-object operations w.r.t. non-volatile ones) the prospect of any compiler deciding to interleave LWLockAcquire/LWLockRelease code with calling code scares me to death. AFAIK the only way we could prevent such problems is for *all* pointers to shared memory to be marked volatile --- which would doubtless blow a good proportion of the speedup one might otherwise hope to get. Within an LWLockAcquire'd segment, shared memory is *not* volatile and we don't want to completely defeat optimization of routines such as the lock and buffer managers. Possibly you could avoid the issue by arranging for lwlock.c to be compiled at a lower optimization level that doesn't expose its routines for merging with callers. > Side note: it seems difficult to get consistent results out of > pgbench. Yeah, I've noticed that too. You really have to do a complete vacuum between runs to get any semblance of stable results. regards, tom lane
Hi Kyle, Would you like to try the icc optimised version of PostgreSQL with the OSDB (Open Source Database Benchmark)? It's based on the AS3AP database benchmark, which I feel is a lot more recognised than pgbench. It's URL is http://osdb.sourceforge.net The latest released version (0.12) has a problem with hash indexes in PostgreSQL (a PostgreSQL bug which Neil Conway has put up his hand to fix), but the latest CVS commit of OSDB has a workaround for that. *If* you don't mind downloading the latest CVS version (it's not a real big program) and compiling that, it would be interesting to see the throughput differences between the gcc compiled and icc compiled versions of PostgreSQL. If you need the dataset generation utility for OSDB, I have that too. Just ask me for it and I'll email it to you. It's a DOS executable, but runs fine with Wine (the windows emulator). :-) Regards and best wishes, Justin Clift Kyle wrote: > > I've been playing around with Intel's x86 C++ compiler (icc) for > linux. The compiler is very good for optimizing x86 code. With some > struggle, I managed to get postgresql compiled with it. I've listed > below what I had to do to get postgres compiled, along with some > results from pgbench. > > Compilation > > icc can compile C code as well as C++. WRT C, icc is binary > compatible with gcc. It aims to be compatible with gcc extensions, > but has a ways to go. > > Note: I only compiled the backend with icc. Targets such as psql/ > pgbench were compiled with gcc. > > * The first problem I encountered was icc couldn't produce the > SUBSYS.o targets for the backend. The work around was to ignore the > SUBSYS.o targets and link with each individual object files. > > * Linking doesn't appear to work with icc's "-ipo" optimization. The > goal of ipo to perform inlining of functions between source files. > This is a bummer, since -ipo can produce very good code. > > * Since icc can't handle inline assembly, a number of files need to be > compiled with gcc. These include: > access/transam/xlog.c > storage/ipc/shmem.c > storage/lmgr/proc.c > storage/lmgr/lwlock.c > storage/lmgr/s_lock.c > utils/adt/pg_lzcompress.c > Hopefully intel will add inline assembly to icc.... > > * I used the gcc frontend to ld instead of using icc directly. ld > seems to hang (3+ hrs cpu time, no output) when invoked from icc on > postgresql for some reason. > > Results: > > I produced two different backend executables. One with gcc and one > with icc. I ran each executable on the same database and benchmarked > with pgbench. > > (1) gcc 2.96 (yeah, RedHat 7.1) options: > -Wall -O3 -fomit-frame-pointer -fforce-addr -fforce-mem > -funroll-loops -malign-loops=2 -malign-functions=2 -malign-jumps=2 > (2) icc 5.0 options: > -O3 -tpp6 -xK -unroll -ip > > pgbench{i} > > [-t 5000] [-t 500 -c 10] [-t 200 -c 25] [-t 25 -c 50] > (1)gcc 94.06 94.74 86.77 149.38 > (2)icc 102.08 100.31 91.10 155.15 > > {i}: results are tps excluding connection establishing, average of 3 runs. > A full vacuum/analyze was performed between runs. > > The results indicate up to a ~10% increase in transactions per second > for pgbench. I've seen improvement more like 20% on some very cpu > intensive programs (ie- lame mp3 encoder). > > If there are some other benchmarks easily run let me know and I'll > give them a go. > > Side note: it seems difficult to get consistent results out of > pgbench. I ended up dropping/recreating/repopulating the database > between runs. I also modified pgbench to have a constant seed to the > random number generator (attempting to get more consistent results). > > Conclusion > > The Intel compiler appears to produce code better than gcc 2.96 when > testing with pgbench. My experience has been that icc excels at > cpu-intensive processes, which might not be reflected in the pgbench > results. Since postgresql can require lots of disk I/O, performance > versus gcc will not be significant on processes already I/O bound. > > The build process currently requires lots of hand tweaking and isn't > entirely possible without gcc. Future versions of icc should improve > upon this. As such, it may be currently obtuse to give postgres > support for icc out of the box but it should be doable if their is > interest. > > My understanding is that the intel evaluation license is ok for > hobbyists, but without purchasing an actual license ($500) the > compiled code cannot be distributed. > > link: http://www.intel.com/software/products/compilers/c50/linux/noncom.htm > > Regards, > Kyle > kaf@_nwlink_._com_ > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi