Thread: New to PostgreSQL, performance considerations
Hi yall, although I've worked with databases for more than 7 years now, I'm petty new to PostgreSQL. I have an application using SQLite3 as an embedded SQL solution because it's simple and it can handle the load that *most* of my clients have. Because of that '*most*' part, because of the client/server way and because of the license, I'm think about start using PostgreSQL. My app uses only three tables: one has low read and really high write rates, a second has high read and low write and the third one is equally high on both. I need a db that can handle something like 500 operations/sec continuously. It's something like 250 writes/sec and 250 reads/sec. My databases uses indexes. Each table would have to handle 5 million rows/day. So I'm thinking about creating different tables (clusters?) to different days to make queries return faster. Am I right or there is no problem in having a 150 million (one month) rows on a table? All my data is e-mail traffic: user's quarentine, inbond traffic, outbond traffic, sender, recipients, subjects, attachments, etc... What do you people say, is it possible with PostgreSQL? What kind of hardware would I need to handle that kind of traffic? On a first test, at a badly tunned AMD Athlon XP 1800+ (ergh!) I could do 1400 writes/sec locally after I disabled fsync. We have UPSs, in the last year we only had 1 power failure. Thank you all for your tips. Best regards, Daniel Colchete
Daniel van Ham Colchete wrote: > although I've worked with databases for more than 7 years now, I'm > petty new to PostgreSQL. Same here. > I need a db that can handle something like 500 operations/sec > continuously. It's something like 250 writes/sec and 250 reads/sec. My > databases uses indexes. Taken from an email to the admin list about a week ago - Stats about the system: Postgres 8.1.4 db size: 200+ GB Inheritance is used extremely heavily, so in figuring out what could cause a create to hang, it may be of interest to know that there are: 101,745 tables 314,821 indexes 1,569 views The last averages taken on the number of writes per hour on this database: ~3 million (this stat is a few weeks old) Machine info: OS: Solaris 10 Sunfire X4100 XL 2x AMD Opteron Model 275 dual core procs 8GB of ram > Each table would have to handle 5 million rows/day. So I'm thinking > about creating different tables (clusters?) to different days to make > queries return faster. Am I right or there is no problem in having a > 150 million (one month) rows on a table? Sounds to me that a month might be on the large size for real fast response times - I would think of seperating weekly rather than daily. Start with http://www.postgresql.org/docs/8.2/interactive/ddl-inherit.html then the next chapter explains using that to partition data into different tables dependant on specified criteria. You may be interested in tsearch2 which is in the contrib dir and adds full text indexing. -- Shane Ambler pgSQL@007Marketing.com Get Sheeky @ http://Sheeky.Biz
Hi Gene, at my postgresql.conf, the only non-comented lines are: fsync = off lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' max_connections = 100 shared_buffers = 5000 temp_buffers = 1000 work_mem = 4096 The only two values I changed are shared_buffers and work_mem. *** BUT *** I'm using Gentoo Linux, so all my libraries (including glibc that is very important to PostgreSQL), and all my softwares are compiled with good CFLAG options to my processor ("-O2 march=athlon-xp (...)"). My Linux is not an Intel-AMD binary compatible turtle like Fedora/RedHat/SUSE/... It's really important to have your GLIBC compiled for your processor. It is essencial for performance. I can't imagine anyone buying a $1k-dollar quad-core XEON and using an i585 compatible distro that doesn't even know what the fudge is SSE/SSE2/vectorized instructions. Best regards, Daniel Colchete On 12/10/06, Gene <genekhart@gmail.com> wrote: > I have a similar type application, I'm partitioning using constraint > exclusion so queries only have to look at a few tables. I've found that > there is some overhead to using partitioning so you should test to see how > many partitions you want to create. Could I check out you postgresql.conf > parameters to compare? thanks > > > Gene Hart
Shane, really good answer man!!! It really helped me! I can't wait to have time to test this partitions thing. This will really solve a lot of my problems (and another one that I was delaying). Instead of making partitions over days/weeks I can tune it to the seconds (timestamp) and try to find the best partition size for performance. That partition thing is really good. That server you sent me handles about 15 times more writes than I would have to. Thank you very much for your answer. Best regards, Daniel Colchete On 12/10/06, Shane Ambler <pgsql@007marketing.com> wrote: > Daniel van Ham Colchete wrote: > > although I've worked with databases for more than 7 years now, I'm > > petty new to PostgreSQL. > > Same here. > > > I need a db that can handle something like 500 operations/sec > > continuously. It's something like 250 writes/sec and 250 reads/sec. My > > databases uses indexes. > > Taken from an email to the admin list about a week ago - > > Stats about the system: > Postgres 8.1.4 > db size: 200+ GB > Inheritance is used extremely heavily, so in figuring out what could > cause a create to hang, it may be of interest to know that there are: > 101,745 tables > 314,821 indexes > 1,569 views > The last averages taken on the number of writes per hour on this > database: ~3 million (this stat is a few weeks old) > > Machine info: > OS: Solaris 10 > Sunfire X4100 XL > 2x AMD Opteron Model 275 dual core procs > 8GB of ram > > > > Each table would have to handle 5 million rows/day. So I'm thinking > > about creating different tables (clusters?) to different days to make > > queries return faster. Am I right or there is no problem in having a > > 150 million (one month) rows on a table? > > Sounds to me that a month might be on the large size for real fast > response times - I would think of seperating weekly rather than daily. > > Start with > http://www.postgresql.org/docs/8.2/interactive/ddl-inherit.html > then the next chapter explains using that to partition data into > different tables dependant on specified criteria. > > You may be interested in tsearch2 which is in the contrib dir and adds > full text indexing. > > > -- > > Shane Ambler > pgSQL@007Marketing.com > > Get Sheeky @ http://Sheeky.Biz >
I'm using gentoo as well, I'm having performance issues as the number of partitions is increasing I imagine do due to overhead managing them and figuring out where to put each insert/update. I'm switching to weekly partitions instead of daily. I believe in PG8.2 constraint exclusion works with updates/deletes also so I'm eager to upgrade. I get about 1 million records per day in two tables each, each record updated about 4 times within 30 minutes.
Do you think using UTF8 vs US-ASCII hurts performance signficantly, some of my smaller tables require unicode, and I don't think you can have some tables be unicode and some be ASCII.
--
Gene Hart
cell: 443-604-2679
Do you think using UTF8 vs US-ASCII hurts performance signficantly, some of my smaller tables require unicode, and I don't think you can have some tables be unicode and some be ASCII.
On 12/10/06, Daniel van Ham Colchete <daniel.colchete@gmail.com> wrote:
Hi Gene,
at my postgresql.conf, the only non-comented lines are:
fsync = off
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
max_connections = 100
shared_buffers = 5000
temp_buffers = 1000
work_mem = 4096
The only two values I changed are shared_buffers and work_mem.
*** BUT ***
I'm using Gentoo Linux, so all my libraries (including glibc that is
very important to PostgreSQL), and all my softwares are compiled with
good CFLAG options to my processor ("-O2 march=athlon-xp (...)"). My
Linux is not an Intel-AMD binary compatible turtle like
Fedora/RedHat/SUSE/... It's really important to have your GLIBC
compiled for your processor. It is essencial for performance.
I can't imagine anyone buying a $1k-dollar quad-core XEON and using an
i585 compatible distro that doesn't even know what the fudge is
SSE/SSE2/vectorized instructions.
Best regards,
Daniel Colchete
On 12/10/06, Gene <genekhart@gmail.com> wrote:
> I have a similar type application, I'm partitioning using constraint
> exclusion so queries only have to look at a few tables. I've found that
> there is some overhead to using partitioning so you should test to see how
> many partitions you want to create. Could I check out you postgresql.conf
> parameters to compare? thanks
>
>
> Gene Hart
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
--
Gene Hart
cell: 443-604-2679
Gene, at Postgres's docs they say that the "constraint checks are relatively expensive". From what you're saying, it's really worth studying the matter deply first. I never understood what's the matter between the ASCII/ISO-8859-1/UTF8 charsets to a database. They're all simple C strings that doesn't have the zero-byte in the midlle (like UTF16 would) and that doesn't require any different processing unless you are doing case insensitive search (them you would have a problem). ASCII chars are also correct UTF8 chars as well. The first 127 Unicode chars are the same as the ASCII chars. So you would not have any problems changing your table from ASCII to UTF8. My software uses UTF16 and UTF8 at some of it's internals and I only notice performance problems with UTF16 (because of the zero-byte thing, the processing I make is diferent). So, I imagine that you wouldn't have any performance issues changing from ASCII to UTF8 if necessary. Nowadays everything is turning to Unicode (thank god). I wouldn't start anything with any other charset. I would only be asking for a rewrite in a near future. Best, Daniel On 12/10/06, Gene <genekhart@gmail.com> wrote: > I'm using gentoo as well, I'm having performance issues as the number of > partitions is increasing I imagine do due to overhead managing them and > figuring out where to put each insert/update. I'm switching to weekly > partitions instead of daily. I believe in PG8.2 constraint exclusion works > with updates/deletes also so I'm eager to upgrade. I get about 1 million > records per day in two tables each, each record updated about 4 times within > 30 minutes. > > Do you think using UTF8 vs US-ASCII hurts performance signficantly, some of > my smaller tables require unicode, and I don't think you can have some > tables be unicode and some be ASCII.
On Dec 11, 2006, at 10:47 , Daniel van Ham Colchete wrote: > I never understood what's the matter between the ASCII/ISO-8859-1/UTF8 > charsets to a database. If what you mean by ASCII is SQL_ASCII, then there is at least one significant difference between UTF8 (the PostgreSQL encoding) and SQL_ASCII. AIUI, SQL_ASCII does no checking at all with respect to what bytes are going in, while UTF8 does make sure to the best of its ability that the bytes represent valid UTF-8 characters, throwing an error if an invalid byte sequence is detected. There's more information regarding this here: http://www.postgresql.org/docs/8.2/interactive/ multibyte.html#MULTIBYTE-CHARSET-SUPPORTED Michael Glaesemann grzm seespotcode net
On Sun, Dec 10, 2006 at 11:02:44PM -0200, Daniel van Ham Colchete wrote: >I'm using Gentoo Linux, so all my libraries (including glibc that is >very important to PostgreSQL), and all my softwares are compiled with >good CFLAG options to my processor ("-O2 march=athlon-xp (...)"). My >Linux is not an Intel-AMD binary compatible turtle like >Fedora/RedHat/SUSE/... It's really important to have your GLIBC >compiled for your processor. It is essencial for performance. Please, point to the benchmarks that demonstrate this for a postgres application. Mike Stone
Hi Daniel On 10-Dec-06, at 8:02 PM, Daniel van Ham Colchete wrote: > Hi Gene, > > at my postgresql.conf, the only non-comented lines are: > fsync = off This can, and will result in lost data. > lc_messages = 'C' > lc_monetary = 'C' > lc_numeric = 'C' > lc_time = 'C' How much memory does this machine have and what version of postgresql are you using? > max_connections = 100 > shared_buffers = 5000 > temp_buffers = 1000 > work_mem = 4096 > > The only two values I changed are shared_buffers and work_mem. Dave > > *** BUT *** > I'm using Gentoo Linux, so all my libraries (including glibc that is > very important to PostgreSQL), and all my softwares are compiled with > good CFLAG options to my processor ("-O2 march=athlon-xp (...)"). My > Linux is not an Intel-AMD binary compatible turtle like > Fedora/RedHat/SUSE/... It's really important to have your GLIBC > compiled for your processor. It is essencial for performance. > > I can't imagine anyone buying a $1k-dollar quad-core XEON and using an > i585 compatible distro that doesn't even know what the fudge is > SSE/SSE2/vectorized instructions. > > Best regards, > Daniel Colchete > > On 12/10/06, Gene <genekhart@gmail.com> wrote: >> I have a similar type application, I'm partitioning using constraint >> exclusion so queries only have to look at a few tables. I've found >> that >> there is some overhead to using partitioning so you should test to >> see how >> many partitions you want to create. Could I check out you >> postgresql.conf >> parameters to compare? thanks >> >> >> Gene Hart > > ---------------------------(end of > broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
On Dec 11, 2006, at 02:47 , Daniel van Ham Colchete wrote: > I never understood what's the matter between the ASCII/ISO-8859-1/UTF8 > charsets to a database. They're all simple C strings that doesn't have > the zero-byte in the midlle (like UTF16 would) and that doesn't > require any different processing unless you are doing case insensitive > search (them you would have a problem). That's not the whole story. UTF-8 and other variable-width encodings don't provide a 1:1 mapping of logical characters to single bytes; in particular, combination characters opens the possibility of multiple different byte sequences mapping to the same code point; therefore, string comparison in such encodings generally cannot be done at the byte level (unless, of course, you first acertain that the strings involved are all normalized to an unambiguous subset of your encoding). PostgreSQL's use of strings is not limited to string comparison. Substring extraction, concatenation, regular expression matching, up/ downcasing, tokenization and so on are all part of PostgreSQL's small library of text manipulation functions, and all deal with logical characters, meaning they must be Unicode-aware. Alexander.
> That's not the whole story. UTF-8 and other variable-width encodings > don't provide a 1:1 mapping of logical characters to single bytes; in > particular, combination characters opens the possibility of multiple > different byte sequences mapping to the same code point; therefore, > string comparison in such encodings generally cannot be done at the > byte level (unless, of course, you first acertain that the strings > involved are all normalized to an unambiguous subset of your encoding). Can you show me such encodings supported by PostgreSQL other than UTF-8? -- Tatsuo Ishii SRA OSS, Inc. Japan
On Dec 11, 2006, at 04:35 , Tatsuo Ishii wrote: >> That's not the whole story. UTF-8 and other variable-width encodings >> don't provide a 1:1 mapping of logical characters to single bytes; in >> particular, combination characters opens the possibility of multiple >> different byte sequences mapping to the same code point; therefore, >> string comparison in such encodings generally cannot be done at the >> byte level (unless, of course, you first acertain that the strings >> involved are all normalized to an unambiguous subset of your >> encoding). > > Can you tell me such encodings supported by PostgreSQL other > than UTF-8? http://www.postgresql.org/docs/8.1/interactive/ multibyte.html#MULTIBYTE-CHARSET-SUPPORTED Alexander.
On Dec 11, 2006, at 10:20 , Tatsuo Ishii wrote: > My question was what kind of encoding other than UTF-8 has a > chracteristic such as: "combination characters opens the possibility > of multiple different byte sequences mapping to the same code point" No idea; perhaps only UTF-8. What I said was that variable-width encodings don't provide a 1:1 mapping of logical characters to single bytes; combination characters is a feature of Unicode. Alexander.
On 12/11/06, Alexander Staubo <alex@purefiction.net> wrote: > On Dec 11, 2006, at 02:47 , Daniel van Ham Colchete wrote: > > > I never understood what's the matter between the ASCII/ISO-8859-1/UTF8 > > charsets to a database. They're all simple C strings that doesn't have > > the zero-byte in the midlle (like UTF16 would) and that doesn't > > require any different processing unless you are doing case insensitive > > search (them you would have a problem). > > That's not the whole story. UTF-8 and other variable-width encodings > don't provide a 1:1 mapping of logical characters to single bytes; in > particular, combination characters opens the possibility of multiple > different byte sequences mapping to the same code point; therefore, > string comparison in such encodings generally cannot be done at the > byte level (unless, of course, you first acertain that the strings > involved are all normalized to an unambiguous subset of your encoding). > > PostgreSQL's use of strings is not limited to string comparison. > Substring extraction, concatenation, regular expression matching, up/ > downcasing, tokenization and so on are all part of PostgreSQL's small > library of text manipulation functions, and all deal with logical > characters, meaning they must be Unicode-aware. > > Alexander. > You're right. I was thinking only about my cases that takes the Unicode normatization for granted and doesn't use regexp/tokenization/... Thanks Best Daniel
Hi Dave, On 12/11/06, Dave Cramer <pg@fastcrypt.com> wrote: > Hi Daniel > On 10-Dec-06, at 8:02 PM, Daniel van Ham Colchete wrote: > > > Hi Gene, > > > > at my postgresql.conf, the only non-comented lines are: > > fsync = off > This can, and will result in lost data. I know... If there is a power failure things can happen. I'm know, but the performance dif is really really big I just have to decide if I'm willing to take that chance or not. > > lc_messages = 'C' > > lc_monetary = 'C' > > lc_numeric = 'C' > > lc_time = 'C' > > How much memory does this machine have and what version of postgresql > are you using? It's only a test server with 512MB RAM, I only used it to see how well would the PostgreSQL do in a ugly case. Daniel
On 11-Dec-06, at 5:36 AM, Daniel van Ham Colchete wrote: > Hi Dave, > > On 12/11/06, Dave Cramer <pg@fastcrypt.com> wrote: >> Hi Daniel >> On 10-Dec-06, at 8:02 PM, Daniel van Ham Colchete wrote: >> >> > Hi Gene, >> > >> > at my postgresql.conf, the only non-comented lines are: >> > fsync = off >> This can, and will result in lost data. > > I know... If there is a power failure things can happen. I'm know, but > the performance dif is really really big I just have to decide if I'm > willing to take that chance or not. > >> > lc_messages = 'C' >> > lc_monetary = 'C' >> > lc_numeric = 'C' >> > lc_time = 'C' >> >> How much memory does this machine have and what version of postgresql >> are you using? > It's only a test server with 512MB RAM, I only used it to see how well > would the PostgreSQL do in a ugly case. Given that optimal performance for postgresql can require up to 50% of available memory, you are going to leave the OS with 256MB of memory ? Dave > > Daniel > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
Mike, unfortunally I don't have any benchmarks right now. Doing something like this with the same machine would be a 2-day work (at least). Installing a Gentoo and putting it to run well is not a quick task (although is easy). But, trust me on this one. It's worth it. Think of this: PostgreSQL and GNU LibC use a lot of complex algorithms: btree, hashes, checksums, strings functions, etc... And you have a lot of ways to compile it into binary code. Now you have Pentium4's vectorization that allow you to run plenty of instructions in paralell, but AMD doesn't have this. Intel also have SSE2 that makes double-precision floatpoint operations a lot faster, AMD also doesn't have this (at least on 32bits). Now imagine that you're RedHat and that you have to deliver one CD to AMD and Intel servers. That means you can't use any AMD-specific or Intel-specific tecnology at the binary level. Things can get worse. If you take a look at GCC's code (at gcc/config/i386/i386.c), you'll see that GCC knows what runs faster on each processor. Let-me give an example with the FDIV and FSQRT instructions: ARCH - FDIV - FSQRT (costs relative to an ADD) i386: 88 - 122 i486: 73 - 83 pentium: 39 - 70 k6: 56 - 56 Athlon: 24 - 35 K8: 19 - 35 Pentium4: 43 - 43 Nocona: 40 - 44 Imagine that you are GCC and that you have two options in front of you: you can use FSQRT or FDIV plus 20 ADD/SUB. If you are on an Pentium situation you should use the second option. But on a Pentium4 or on a Athlon you should choose for the first one. This was only an example, nowadays you would have to choose between: 387, 3dNow, 3dNow+, SSE, SSE2, ... With this info, GCC knows how to choose the best ways to doing things to each processor. When you are compiling to an generic i586 (like Fedora and RedHat), them you are using pentium's timings. An example that I know of: it's impossible to run my software at a high demanding customer without compiling it to the it's processor (I make 5 compilations on every release). Using Intel's Compiler for Intel's processors makes things even faster, but it is not free and the "how fast" part really depends on your application is coded. With 64bits processors, AMD and Intel restarted the process and everyone has SSE2 (but not SSE3). Even so, the timings are also very diferent. Best regards Daniel On 12/11/06, Michael Stone <mstone+postgres@mathom.us> wrote: > On Sun, Dec 10, 2006 at 11:02:44PM -0200, Daniel van Ham Colchete wrote: > >I'm using Gentoo Linux, so all my libraries (including glibc that is > >very important to PostgreSQL), and all my softwares are compiled with > >good CFLAG options to my processor ("-O2 march=athlon-xp (...)"). My > >Linux is not an Intel-AMD binary compatible turtle like > >Fedora/RedHat/SUSE/... It's really important to have your GLIBC > >compiled for your processor. It is essencial for performance. > > Please, point to the benchmarks that demonstrate this for a postgres > application. > > Mike Stone > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
> >> > >> How much memory does this machine have and what version of postgresql > >> are you using? > > It's only a test server with 512MB RAM, I only used it to see how well > > would the PostgreSQL do in a ugly case. > > Given that optimal performance for postgresql can require up to 50% > of available memory, you are going to leave the OS with 256MB of > memory ? If it were the case I wouldn't have any problems letting the OS use only 256MB, but this is not my production server. My production is built yet. It'll have at least 2GB of memory. But it's good to know anyway.
On Mon, Dec 11, 2006 at 09:05:56AM -0200, Daniel van Ham Colchete wrote: > But, trust me on this one. It's worth it. You know what? I don't. > Think of this: PostgreSQL and GNU LibC use a lot of complex algorithms: > btree, hashes, checksums, strings functions, etc... And you have a lot of > ways to compile it into binary code. Now you have Pentium4's vectorization > that allow you to run plenty of instructions in paralell, but AMD doesn't > have this. Intel also have SSE2 that makes double-precision floatpoint > operations a lot faster, AMD also doesn't have this (at least on 32bits). Athlon 64 has SSE2, also in 32-bit-mode. Of course, it doesn't really matter, since at the instant you hit the disk even once, it's going to take a million cycles and any advantage you got from saving single cycles is irrelevant. > Imagine that you are GCC and that you have two options in front of > you: you can use FSQRT or FDIV plus 20 ADD/SUB. Could you please describe a reasonable case where GCC would have such an option? I cannot imagine any. > An example that I know of: it's impossible to run my software at a > high demanding customer without compiling it to the it's processor (I > make 5 compilations on every release). What's "your software"? How can you make such assertions without backing them up? How can you know that the same holds for PostgreSQL? As Mike said, point to the benchmarks showing this "essential" difference between -O2 and -O2 -mcpu=pentium4 (or whatever). The only single worthwhile difference I can think of, is that glibc can use the SYSENTER function if it knows you have a 686 or higher (which includes AMD), and with recent kernels, I'm not even sure if that is needed anymore. /* Steinar */ -- Homepage: http://www.sesse.net/
On Mon, Dec 11, 2006 at 09:05:56AM -0200, Daniel van Ham Colchete wrote: >unfortunally I don't have any benchmarks right now. That's fairly normal for gentoo users pushing their compile options. Mike Stone
On 12/11/06, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote: > On Mon, Dec 11, 2006 at 09:05:56AM -0200, Daniel van Ham Colchete wrote: > > But, trust me on this one. It's worth it. > > You know what? I don't. So test it yourself. > > Think of this: PostgreSQL and GNU LibC use a lot of complex algorithms: > > btree, hashes, checksums, strings functions, etc... And you have a lot of > > ways to compile it into binary code. Now you have Pentium4's vectorization > > that allow you to run plenty of instructions in paralell, but AMD doesn't > > have this. Intel also have SSE2 that makes double-precision floatpoint > > operations a lot faster, AMD also doesn't have this (at least on 32bits). > > Athlon 64 has SSE2, also in 32-bit-mode. It's true. But, I'm not saying that Postfix is faster on AMD or Intel systems. I'm saying that it's a lot faster on you compile Postfix and your glibc to your processor. AMD also has features that Intel systems doesn't: 3dNow for example. The fact is that if your distro is compatible with a plain Athlon, you can't use neighter SSE nor SSE2. > Of course, it doesn't really matter, since at the instant you hit the disk > even once, it's going to take a million cycles and any advantage you got from > saving single cycles is irrelevant. Really??? We're talking about high performance systems and every case is diferent. I once saw a ddr2 ram based storage once (like 1TB). Before you say it, I don't understand how it works, but you won't lose your data on a reboot or powerfailure. It was very expensive but really solve this thing with the IO bottleneck. Even when your bottleneck is the IO, still makes no sense to waste CPU resources unnecessarily. > > Imagine that you are GCC and that you have two options in front of > > you: you can use FSQRT or FDIV plus 20 ADD/SUB. > > Could you please describe a reasonable case where GCC would have such an > option? I cannot imagine any. As I said, it is an example. Take floatpoint divisions. You have plenty of ways of doing it: 387, MMX, SSE, 3dNow, etc... Here GCC have to make a choice. And this is only one case. Usually, compiler optimizations are really complex and the processor's timings counts a lot. At every optimization the compile needs to mesure the quickest path, so it uses information on how the processor will run the code. If you take a look the AMD's docs you will see that theirs SSE2 implementation is diferent from Intel's internally. So, sometimes the quickest path uses SSE2 and sometimes it doesn't. You also have to count the costs of converting SSE registers to commom ones. If you still can't imagine any case, you can read Intel's assembler reference. You'll see that there are a lot of ways of doing a lot of things. > > An example that I know of: it's impossible to run my software at a > > high demanding customer without compiling it to the it's processor (I > > make 5 compilations on every release). > > What's "your software"? How can you make such assertions without backing them > up? How can you know that the same holds for PostgreSQL? > > As Mike said, point to the benchmarks showing this "essential" difference > between -O2 and -O2 -mcpu=pentium4 (or whatever). The only single worthwhile > difference I can think of, is that glibc can use the SYSENTER function if it > knows you have a 686 or higher (which includes AMD), and with recent kernels, > I'm not even sure if that is needed anymore. Steinar, you should really test it. I won't read the PostgreSQL source to point you were it could use SSE or SSE2 or whatever. And I won't read glibc's code. You don't need to belive in what I'm saying. You can read GCC docs, Intel's assembler reference, AMD's docs about their processor and about how diferent that arch is. Best regards, Daniel Colchete
This is very very very true :-)! I just remebered one case with MySQL. When I changed the distro from Conectiva 10 (rpm-based ended brazilian distro) to Gentoo, a MySQL operation that usually took 2 minutes to run, ended in 47 seconds. This is absolutely vage. I don't have how to prove it to you. The old situation doesn't even exists anymore because I used the same hardware on the upgrade. And I can't mesure how each factor helped: compiling glibc and Mysql with good cflags, rebuilding my database in a ordered way, never kernel, etc.. All I know is that this process still runs with less than 1 minute (my database is larger now). I used the very same hardware: P4 3.0Ghz SATA disk without RAID. And I only upgraded because Conectiva's support to their version 10 ended and I need to keep my system up with the security patches. Best, Daniel On 12/11/06, Michael Stone <mstone+postgres@mathom.us> wrote: > On Mon, Dec 11, 2006 at 09:05:56AM -0200, Daniel van Ham Colchete wrote: > >unfortunally I don't have any benchmarks right now. > > That's fairly normal for gentoo users pushing their compile options. > > Mike Stone
Oops! daniel.colchete@gmail.com ("Daniel van Ham Colchete") was seen spray-painting on a wall: > But, trust me on this one. It's worth it. No, the point of performance analysis is that you *can't* trust the people that say "trust me on this one." If you haven't got a benchmark where you can demonstrate a material and repeatable difference, then you're just some Gentoo "speed racer" making things up. -- select 'cbbrowne' || '@' || 'acm.org'; http://linuxdatabases.info/info/wp.html One last point about metaphor, poetry, etc. As an example to illustrate these capabilities in Sastric Sanskrit, consider the "bahuvrihi" construct (literally "man with a lot of rice") which is used currently in linguistics to describe references outside of compounds.
You are right Christopher. Okay. Let's solve this matter. What PostgreSQL benchmark software should I use??? I'll test PostgreSQL 8.1 on a Fedora Core 6 and on a Gentoo. I'll get the same version FC6 uses and install it at my Gentoo. I'll use the same hardware (diferent partitions to each). I'm not saying that Gentoo is faster than FC6. I just want to prove that if you compile your software to make better use of your processor, it will run faster. It might take a few days because I'm pretty busy right now at my job. Best regards, Daniel On 12/11/06, Christopher Browne <cbbrowne@acm.org> wrote: > Oops! daniel.colchete@gmail.com ("Daniel van Ham Colchete") was seen spray-painting on a wall: > > But, trust me on this one. It's worth it. > > No, the point of performance analysis is that you *can't* trust the > people that say "trust me on this one." > > If you haven't got a benchmark where you can demonstrate a material > and repeatable difference, then you're just some Gentoo "speed racer" > making things up. > -- > select 'cbbrowne' || '@' || 'acm.org'; > http://linuxdatabases.info/info/wp.html > One last point about metaphor, poetry, etc. As an example to > illustrate these capabilities in Sastric Sanskrit, consider the > "bahuvrihi" construct (literally "man with a lot of rice") which is > used currently in linguistics to describe references outside of > compounds. > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
On Mon, Dec 11, 2006 at 11:09:13AM -0200, Daniel van Ham Colchete wrote: >> You know what? I don't. > So test it yourself. You're making the claims, you're supposed to be proving them... > As I said, it is an example. Take floatpoint divisions. You have > plenty of ways of doing it: 387, MMX, SSE, 3dNow, etc... Here GCC have > to make a choice. No, you don't. MMX, SSE and 3Dnow! will all give you the wrong result (reduced precision). SSE2, on the other hand, has double precision floats, so you might have a choice there -- except that PostgreSQL doesn't really do a lot of floating-point anyhow. > And this is only one case. Usually, compiler optimizations are really > complex and the processor's timings counts a lot. You keep asserting this, with no good backing. > If you still can't imagine any case, you can read Intel's assembler > reference. You'll see that there are a lot of ways of doing a lot of > things. I've been programming x86 assembler for ten years or so... > Steinar, you should really test it. I won't read the PostgreSQL source > to point you were it could use SSE or SSE2 or whatever. And I won't > read glibc's code. Then you should stop making these sort of wild claims. > You don't need to belive in what I'm saying. You can read GCC docs, > Intel's assembler reference, AMD's docs about their processor and > about how diferent that arch is. I have. /* Steinar */ -- Homepage: http://www.sesse.net/
On 12/11/06, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote: > On Mon, Dec 11, 2006 at 11:09:13AM -0200, Daniel van Ham Colchete wrote: > >> You know what? I don't. > > So test it yourself. > > You're making the claims, you're supposed to be proving them... > > > As I said, it is an example. Take floatpoint divisions. You have > > plenty of ways of doing it: 387, MMX, SSE, 3dNow, etc... Here GCC have > > to make a choice. > > No, you don't. MMX, SSE and 3Dnow! will all give you the wrong result > (reduced precision). SSE2, on the other hand, has double precision floats, so > you might have a choice there -- except that PostgreSQL doesn't really do a > lot of floating-point anyhow. > > > And this is only one case. Usually, compiler optimizations are really > > complex and the processor's timings counts a lot. > > You keep asserting this, with no good backing. > > > If you still can't imagine any case, you can read Intel's assembler > > reference. You'll see that there are a lot of ways of doing a lot of > > things. > > I've been programming x86 assembler for ten years or so... So, I'm a newbie to you. I learned x86 assembler last year. > > Steinar, you should really test it. I won't read the PostgreSQL source > > to point you were it could use SSE or SSE2 or whatever. And I won't > > read glibc's code. > > Then you should stop making these sort of wild claims. > > > You don't need to belive in what I'm saying. You can read GCC docs, > > Intel's assembler reference, AMD's docs about their processor and > > about how diferent that arch is. > > I have. > > /* Steinar */ Okay, I'll do the benchmarks. Just sent an e-mail about this to the list. If you have any sugestions of how to make the benchmark please let-me know. I like when I prove myself wrong. Although it's much better when I'm right :-)... Best regards, Daniel Colchete
On Mon, Dec 11, 2006 at 11:17:06AM -0200, Daniel van Ham Colchete wrote: > I just remebered one case with MySQL. When I changed the distro from > Conectiva 10 (rpm-based ended brazilian distro) to Gentoo, a MySQL > operation that usually took 2 minutes to run, ended in 47 seconds. How do you know that this improvement had _anything_ to do with the use of different optimization flags? Were even the MySQL versions or configuration the same? > This is absolutely vage. Indeed it is. > I don't have how to prove it to you. No, but you should stop making this sort of "absolutely essential" claims if you can't. > And I can't mesure how each factor helped: compiling glibc and Mysql with > good cflags, rebuilding my database in a ordered way, never kernel, etc.. Exactly. So why are you attributing it to the first factor only? And why do you think this would carry over to PostgreSQL? Remember, anecdotal evidence isn't. /* Steinar */ -- Homepage: http://www.sesse.net/
On Mon, Dec 11, 2006 at 11:31:48AM -0200, Daniel van Ham Colchete wrote: > What PostgreSQL benchmark software should I use??? Look up the list archives; search for "TPC". > I'll test PostgreSQL 8.1 on a Fedora Core 6 and on a Gentoo. I'll get > the same version FC6 uses and install it at my Gentoo. I'll use the > same hardware (diferent partitions to each). Why do you want to compare FC6 and Gentoo? Wasn't your point that the -march= was supposed to be the relevant factor here? In that case, you want to keep all other things equal; so use the same distribution, only with -O2 -march=i686 vs. -march=athlon-xp (or whatever). /* Steinar */ -- Homepage: http://www.sesse.net/
On 12/11/06, Daniel van Ham Colchete <daniel.colchete@gmail.com> wrote: > But, trust me on this one. It's worth it. Think of this: PostgreSQL > and GNU LibC use a lot of complex algorithms: btree, hashes, > checksums, strings functions, etc... And you have a lot of ways to > compile it into binary code. Now you have Pentium4's vectorization > that allow you to run plenty of instructions in paralell, but AMD > doesn't have this. Intel also have SSE2 that makes double-precision > floatpoint operations a lot faster, AMD also doesn't have this (at > least on 32bits). Now imagine that you're RedHat and that you have to > deliver one CD to AMD and Intel servers. That means you can't use any > AMD-specific or Intel-specific tecnology at the binary level. AMD processors since the K6-2 and I think Intel ones since P-Pro are essentially RISC processors with a hardware microcode compiler that translates and reorganizes instructions on the fly. Instruction choice and ordering was extremely important in older 32 bit architectures (like the 486) but is much less important these days. I think you will find that an optimized glibc might be faster in specific contrived cases, the whole is unfortunately less than the sum of its parts. While SSE2 might be able to optimize things like video decoding and the like, for most programs it's of little benifit and IMO a waste of time. Also as others pointed out things like cache hits/misses and i/o considerations are actually much more important than instruction execution speed. We ran Gentoo here for months and did not to be faster enough to merit the bleeding edge quirks it has for production environments. If you dig assembly, there was an interesting tackle of the spinlocks code on the hackers list last year IIRC. merlin
daniel.colchete@gmail.com ("Daniel van Ham Colchete") writes: > You are right Christopher. > > Okay. Let's solve this matter. > > What PostgreSQL benchmark software should I use??? pgbench is one option. There's a TPC-W at pgFoundry (<http://pgfoundry.org/projects/tpc-w-php/>). There's the Open Source Database Benchmark. (<http://osdb.sourceforge.net/>) Those are three reasonable options. > I'll test PostgreSQL 8.1 on a Fedora Core 6 and on a Gentoo. I'll get > the same version FC6 uses and install it at my Gentoo. I'll use the > same hardware (diferent partitions to each). Wrong approach. You'll be comparing apples to oranges, because Gentoo and Fedora pluck sources from different points in the source code stream. In order to prove what you want to prove, you need to run the benchmarks all on Gentoo, where you run with 4 categorizations: 1. Where you run PostgreSQL and GLIBC without any processor-specific optimizations 2. Where you run PostgreSQL and GLIBC with all relevant processor-specific optimizations 3. Where you run PostgreSQL with, and GLIBC without processor-specific optimizations 4. Where you run PostgreSQL without, and GLIBC with processor-specific optimizations That would allow one to clearly distinguish which optimizations are particularly relevant. > I'm not saying that Gentoo is faster than FC6. I just want to prove > that if you compile your software to make better use of your > processor, it will run faster. > > It might take a few days because I'm pretty busy right now at my > job. I expect that you'll discover, if you actually do these tests, that this belief is fairly much nonsense. - Modern CPUs do a huge amount of on-CPU self-tuning. - CPU features that could have a material effect tend to be unusable when compiling general purpose libraries and applications. GCC doesn't generate MMX-like instructions. - Database application performance tends to be I/O driven. - When database application performance *isn't* I/O driven, it is likely to be driven by cache management, which compiler options won't affect. -- output = reverse("ofni.secnanifxunil" "@" "enworbbc") http://linuxdatabases.info/info/sgml.html "very few people approach me in real life and insist on proving they are drooling idiots." -- Erik Naggum, comp.lang.lisp
This definitely is the correct approach. Actually, Daniel van Ham Colchete may not be as "all wet" as some around here think. We've had previous data that shows that pg can become CPU bound (see previous posts by Josh Berkus and others regarding CPU overhead in what should be IO bound tasks). In addition, we know that x86 compatible 64b implementations differ enough between AMD and Intel products that it sometimes shows on benches. Evidence outside the DBMS arena supports the hypothesis that recent CPUs are needing more hand-holding and product specific compiling, not less, compared to their previous versions. Side Note: I wonder what if anything pg could gain from using SWAR instructions (SSE*, MMX, etc)? I'd say the fairest attitude is to do everything we can to support having the proper experiments done w/o presuming the results. Ron Peacetree At 10:47 AM 12/11/2006, Chris Browne wrote: >In order to prove what you want to prove, you need to run the >benchmarks all on Gentoo, where you run with 4 categorizations: > > 1. Where you run PostgreSQL and GLIBC without any processor-specific > optimizations > > 2. Where you run PostgreSQL and GLIBC with all relevant > processor-specific optimizations > > 3. Where you run PostgreSQL with, and GLIBC without > processor-specific optimizations > > 4. Where you run PostgreSQL without, and GLIBC with processor-specific > optimizations > >That would allow one to clearly distinguish which optimizations are >particularly relevant. > > > I'm not saying that Gentoo is faster than FC6. I just want to prove > > that if you compile your software to make better use of your > > processor, it will run faster. > > > > It might take a few days because I'm pretty busy right now at my > > job. > >I expect that you'll discover, if you actually do these tests, that >this belief is fairly much nonsense. > >- Modern CPUs do a huge amount of on-CPU self-tuning. > >- CPU features that could have a material effect tend to be unusable > when compiling general purpose libraries and applications. GCC > doesn't generate MMX-like instructions. > >- Database application performance tends to be I/O driven. > >- When database application performance *isn't* I/O driven, it is > likely to be driven by cache management, which compiler options > won't affect.
On Mon, Dec 11, 2006 at 12:15:51PM -0500, Ron wrote: >I'd say the fairest attitude is to do everything we can to support >having the proper experiments done w/o presuming the results. Who's presuming results?[1] It is fair to say that making extraordinary claims without any evidence should be discouraged. It's also fair to say that if there are specific things that need cpu-specific tuning they'll be fairly limited critical areas (e.g., locks) which would probably be better implemented with a hand-tuned code and runtime cpu detection than by magical mystical compiler invocations. Mike Stone [1] I will say that I have never seen a realistic benchmark of general code where the compiler flags made a statistically significant difference in the runtime. There are some particularly cpu-intensive codes, like some science simulations or encoding routines where they matter, but that's not the norm--and many of those algorithms already have hand-tuned versions which will outperform autogenerated code. You'd think that with all the talk that the users of certain OS's generate about CFLAG settings, there'd be some well-published numbers backing up the hype. At any rate if there were numbers to back the claim then I think they could certainly be considered without prejudice.
Statements like these can not be reasonably interpreted in any manner _except_ that of presuming the results: "I expect that you'll discover, if you actually do these tests, that this belief (that using arch specific compiler options lead to better performing SW) is fairly much nonsense." "...IMO a waste of time..." etc The correct objective response to claims w/o evidence is to request evidence, and to do everything we can to support it being properly gathered. Not to try to discourage the claimant from even trying by ganging up on them with multiple instances of Argument From Authority or variations of Ad Hominem attacks. (The validity of the claim has nothing to do with the skills or experience of the claimant or anyone else in the discussion. Only on the evidence.) It is a tad unfair and prejudicial to call claims that CPU optimizations matter to the performance of DB product "extraordinary". Evidence outside the DBMS field exists; and previous posts here show that pg can indeed become CPU-bound during what should be IO bound tasks. At the moment, Daniel's claims are not well supported. That is far different from being "extraordinary" given the current circumstantial evidence. Let's also bear in mind that as a community project, we can use all the help we can get. Driving potential resources away is in opposition to that goal. [1] The evidence that arch specific flags matter to performance can be found as easily as recompiling your kernel or your compiler. While it certainly could be argued how "general purpose" such SW is, the same could be said for just about any SW at some level of abstraction. Ron Peacetree At 12:31 PM 12/11/2006, Michael Stone wrote: >On Mon, Dec 11, 2006 at 12:15:51PM -0500, Ron wrote: >>I'd say the fairest attitude is to do everything we can to support >>having the proper experiments done w/o presuming the results. > >Who's presuming results?[1] It is fair to say that making >extraordinary claims without any evidence should be discouraged. >It's also fair to say that if there are specific things that need >cpu-specific tuning they'll be fairly limited critical areas (e.g., >locks) which would probably be better implemented with a hand-tuned >code and runtime cpu detection than by magical mystical compiler invocations. > >Mike Stone > >[1] I will say that I have never seen a realistic benchmark of >general code where the compiler flags made a statistically >significant difference in the runtime. There are some particularly >cpu-intensive codes, like some science simulations or encoding >routines where they matter, but that's not the norm--and many of >those algorithms already have hand-tuned versions which will >outperform autogenerated code. You'd think that with all the talk >that the users of certain OS's generate about CFLAG settings, >there'd be some well-published numbers backing up the hype. At any >rate if there were numbers to back the claim then I think they could >certainly be considered without prejudice.
Michael, On 12/11/06 9:31 AM, "Michael Stone" <mstone+postgres@mathom.us> wrote: > [1] I will say that I have never seen a realistic benchmark of general > code where the compiler flags made a statistically significant > difference in the runtime. Here's one - I wrote a general purpose Computational Fluid Dynamics analysis method used by hundreds of people to perform aircraft and propulsion systems analysis. Compiler flag tuning would speed it up by factors of 2-3 or even more on some architectures. The reason it was so effective is that the structure of the code was designed to be general, but also to expose the critical performance sections in a way that the compilers could use - deep pipelining/vectorization, unrolling, etc, were carefully made easy for the compilers to exploit in critical sections. Yes, this made the code in those sections harder to read, but it was a common practice because it might take weeks of runtime to get an answer and performance mattered. The problem I see with general purpose DBMS code the way it's structured in pgsql (and others) is that many of the critical performance sections are embedded in abstract interfaces that obscure them from optimization. An example is doing a simple "is equal to" operation has many layers surrounding it to ensure that UDFs can be declared and that special comparison semantics can be accomodated. But if you're simply performing a large number of INT vs. INT comparisons, it will be thousands of times slower than a CPU native operation because of the function call overhead, etc. I've seen presentations that show IPC of Postgres at about 0.5, versus the 2-4 possible from the CPU. Column databases like C-Store remove these abstractions at planner time to expose native operations in large chunks to the compiler and the IPC reflects that - typically 1+ and as high as 2.5. If we were to redesign the executor and planner to emulate that same structure we could achieve similar speedups and the compiler would matter more. - Luke
On Mon, Dec 11, 2006 at 01:20:50PM -0500, Ron wrote: >(The validity of the claim has nothing to do with the skills or >experience of the claimant or anyone else in the discussion. Only on >the evidence.) Please go back and reread the original post. I don't think the response was unwarranted. > It is a tad unfair and prejudicial to call claims that CPU >optimizations matter to the performance of DB product "extraordinary". >Evidence outside the DBMS field exists; and previous posts here show >that pg can indeed become CPU-bound during what should be IO bound tasks. >At the moment, Daniel's claims are not well supported. That is far >different from being "extraordinary" given the current circumstantial >evidence. No, they're extraordinary regardless of whether postgres is CPU bound. The question is whether cpu-specific compiler flags will have a significant impact--which is, historically, fairly unlikely. Far more likely is that performance can be improved with either a non-cpu-specific optimization (e.g., loop unrolling vs not) or with an algorithmic enhancement. More importantly, you're arguing *your own* point, not the original claim. I'll refresh your memory: "My Linux is not an Intel-AMD binary compatible turtle like Fedora/RedHat/SUSE/... It's really important to have your GLIBC compiled for your processor. It is essencial for performance." You wanna draw the line between that (IMO, extraordinary) claim and the rational argument that you're trying to substitute in its place? >[1] The evidence that arch specific flags matter to performance can >be found as easily as recompiling your kernel or your >compiler. Then, please, point to the body of evidence. IME, the results of such efforts aren't statistically all that signficant on most workloads. I'm sure there are edge cases, but it's certainly not going to be on my top ten things to look at when tuning a database system. (If your kernel's cpu utilization is the bottleneck in your database, you've probably got bigger problems than compiler flags can solve.) Where you get the real big benefits in a (linux) kernel recompile is when you select code that's specifically tuned for a particular processor--not from arch-specific gcc flags--and those sorts of things are increasingly moving toward boot-time autotuning rather than compile-time manual tuning for obvious reasons. Mike Stone
On Mon, Dec 11, 2006 at 10:30:55AM -0800, Luke Lonergan wrote: >Here's one - I wrote a general purpose Computational Fluid Dynamics analysis >method used by hundreds of people to perform aircraft and propulsion systems >analysis. That's kinda the opposite of what I meant by general code. I was trying (perhaps poorly) to distinguish between scientific codes and other stuff (especially I/O or human interface code). >Compiler flag tuning would speed it up by factors of 2-3 or even >more on some architectures. The reason it was so effective is that the >structure of the code was designed to be general, but also to expose the >critical performance sections in a way that the compilers could use - deep >pipelining/vectorization, unrolling, etc, were carefully made easy for the >compilers to exploit in critical sections. It also sounds like code specifically written to take advantage of compiler techniques, rather than random code thrown at a pile of cflags. I don't disagree that it is possible to get performance improvements if code is written to be performant code; I do (and did) disagree with the idea that you'll get huge performance improvements by taking regular old C application code and playing with compiler flags. >Yes, this made the code in those >sections harder to read, but it was a common practice because it might >take weeks of runtime to get an answer and performance mattered. IMO that's appropriate for some science codes (although I think even that sector is beginning to find that they've gone too far in a lot of ways), but for a database I'd rather have people debugging clean, readable code than risking my data to something incomprehensible that runs in optimal time. >Column databases like C-Store remove these abstractions at planner time to >expose native operations in large chunks to the compiler and the IPC >reflects that - typically 1+ and as high as 2.5. If we were to redesign the >executor and planner to emulate that same structure we could achieve similar >speedups and the compiler would matter more. gcc --make-it-really-fast-by-rewriting-it-from-the-ground-up? Mike Stone
On 12/11/06, Ron <rjpeace@earthlink.net> wrote: > Statements like these can not be reasonably interpreted in any manner > _except_ that of presuming the results: > > "I expect that you'll discover, if you actually do these tests, that > this belief (that using arch specific compiler options lead to > better performing SW) is fairly much nonsense." > > "...IMO a waste of time..." > > etc > > The correct objective response to claims w/o evidence is to request > evidence, and to do everything we can to support it being properly > gathered. Not to try to discourage the claimant from even trying by > ganging up on them with multiple instances of Argument From Authority > or variations of Ad Hominem attacks. > (The validity of the claim has nothing to do with the skills or > experience of the claimant or anyone else in the discussion. Only on > the evidence.) /shrugs, this is not debate class, I just happened to have barked up this particular tree before, and decided to share my insights from it. A lot of the misunderstanding here stems from legacy perceptions about how cpus work, not to mention the entire architecture. If somebody produces hard facts to the contrary, great, and I encourage them to do so. also, some people posting here, not necessarily me, are authority figures. :-) merlin
At 01:47 PM 12/11/2006, Michael Stone wrote: >On Mon, Dec 11, 2006 at 01:20:50PM -0500, Ron wrote: >>(The validity of the claim has nothing to do with the skills or >>experience of the claimant or anyone else in the discussion. Only >>on the evidence.) > >Please go back and reread the original post. I don't think the >response was unwarranted. So he's evidently young and perhaps a trifle over-enthusiast. We were once too. ;-) We are not going to get valuable contributions nor help people become more valuable to the community by "flaming them into submission". ...and who knows, =properly= done experiment may provide both surprises and unexpected insights/benefits. I agree completely with telling him he needs to get better evidence and even with helping him understand how he should go about getting it. It should be noted that his opposition has not yet done these experiments either. (Else they could just simply point to the results that refute Daniel's hypothesis.) The reality is that a new CPU architecture and multiple new memory technologies are part of this discussion. I certainly do not expect them to change the fundamental thinking regarding how to get best performance for a DBMS. OTOH, there are multiple valid reasons to give such new stuff a thorough and rigorous experimental shake-down. ATM, =both= sides of this debate are lacking evidence for their POV. Let's support getting definitive evidence. No matter who brings it to the table ;-) Ron Peacetree
On 12/11/06, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote: > On Mon, Dec 11, 2006 at 11:17:06AM -0200, Daniel van Ham Colchete wrote: > > I just remebered one case with MySQL. When I changed the distro from > > Conectiva 10 (rpm-based ended brazilian distro) to Gentoo, a MySQL > > operation that usually took 2 minutes to run, ended in 47 seconds. > > How do you know that this improvement had _anything_ to do with the use of > different optimization flags? Were even the MySQL versions or configuration > the same? > > > This is absolutely vage. > > Indeed it is. Finally we agreed on something. > > > I don't have how to prove it to you. > > No, but you should stop making this sort of "absolutely essential" claims if > you can't. > > > And I can't mesure how each factor helped: compiling glibc and Mysql with > > good cflags, rebuilding my database in a ordered way, never kernel, etc.. > > Exactly. So why are you attributing it to the first factor only? And why do > you think this would carry over to PostgreSQL? > > Remember, anecdotal evidence isn't. But that's exactly what I said. I'm not attributing this case to the optimization factor. As I said there are a lot of factors involved. The MySQL version change of a minor upgrade (from 4.1.15 to 4.1.21). Steinar, I say I'll do the benchmark and it will be the end of the story. > > /* Steinar */
At 02:28 PM 12/11/2006, Merlin Moncure wrote: >also, some people posting here, not necessarily me, are authority figures. :-) > >merlin Noam Chomsky was one of the most influential thinkers in Linguistics to yet have lived. He was proven wrong a number of times. Even within Linguistics. There are plenty of other historical examples. As others have said, opinion without evidence and logic is just that- opinion. And even Expert Opinion has been known to be wrong. Sometimes very much so. Part of what makes an expert an expert is that they can back up their statements with evidence and logic that are compelling even to the non expert when asked to do so. All I'm saying is let's all remember how "assume" is spelled and support the getting of some hard data. Ron Peacetree
Michael, On 12/11/06 10:57 AM, "Michael Stone" <mstone+postgres@mathom.us> wrote: > That's kinda the opposite of what I meant by general code. I was trying > (perhaps poorly) to distinguish between scientific codes and other > stuff (especially I/O or human interface code). Yes - choice of language has often been a differentiator in these markets - LISP versus FORTRAN, C++ versus SQL/DBMS. This isn't just about science, it's also in Business Intelligence - e.g. Special purpose datamining code versus algorithms expressed inside a data management engine. > It also sounds like code specifically written to take advantage of > compiler techniques, rather than random code thrown at a pile of cflags. > I don't disagree that it is possible to get performance improvements if > code is written to be performant code; I do (and did) disagree with the > idea that you'll get huge performance improvements by taking regular old > C application code and playing with compiler flags. Agreed - that's my point exactly. > IMO that's appropriate for some science codes (although I think even > that sector is beginning to find that they've gone too far in a lot of > ways), but for a database I'd rather have people debugging clean, readable > code than risking my data to something incomprehensible that runs in > optimal time. Certainly something of a compromise is needed. >> Column databases like C-Store remove these abstractions at planner time to > > gcc --make-it-really-fast-by-rewriting-it-from-the-ground-up? Maybe not from ground->up, but rather from about 10,000 ft -> 25,000 ft? There are some who have done a lot of work studying the impact of more efficient DBMS, see here: http://homepages.cwi.nl/~boncz/x100.html - Luke
On 12/11/06, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote: > On Mon, Dec 11, 2006 at 11:31:48AM -0200, Daniel van Ham Colchete wrote: > > What PostgreSQL benchmark software should I use??? > > Look up the list archives; search for "TPC". > > > I'll test PostgreSQL 8.1 on a Fedora Core 6 and on a Gentoo. I'll get > > the same version FC6 uses and install it at my Gentoo. I'll use the > > same hardware (diferent partitions to each). > > Why do you want to compare FC6 and Gentoo? Wasn't your point that the -march= > was supposed to be the relevant factor here? In that case, you want to keep > all other things equal; so use the same distribution, only with -O2 > -march=i686 vs. -march=athlon-xp (or whatever). > > /* Steinar */ Using Gentoo is just a easy way to make cflag optimizations to all the other libs as well: glibc, ... I can also mesure performance on Gentoo with cflag optimized PostgreSQL and plain PostgreSQL as well.
On Mon, Dec 11, 2006 at 02:51:09PM -0500, Ron wrote: >Let's support getting definitive evidence. Since nobody opposed the concept of contrary evidence, I don't suppose you're fighting an uphill battle on that particular point. It's fine to get preachy about supporting intellectual curiosity, but do remember that it's a waste of everyone's (limited) time to give equal time to all theories. If someone comes to you with an idea for a perpetual motion machine your effort is probably better spent on something other than helping him build it, regardless of whether that somehow seems unfair. (Now if he brings a working model, that's a different story...) Heck, even building a bunch of non-working perpetual motion machines as a demonstration is a waste of time, because it's always easy to say "well, if you had just...". That's precisely why the person bringing the extraordinary claim is also expected to bring the proof, rather than expecting that everyone else prove the status quo. Mike Stone
On 12/11/06, Chris Browne <cbbrowne@acm.org> wrote: > daniel.colchete@gmail.com ("Daniel van Ham Colchete") writes: > > You are right Christopher. > > > > Okay. Let's solve this matter. > > > > What PostgreSQL benchmark software should I use??? > > pgbench is one option. > > There's a TPC-W at pgFoundry > (<http://pgfoundry.org/projects/tpc-w-php/>). > > There's the Open Source Database Benchmark. > (<http://osdb.sourceforge.net/>) > > Those are three reasonable options. Thanks Chris, I'm going to take a look at those options. > > > I'll test PostgreSQL 8.1 on a Fedora Core 6 and on a Gentoo. I'll get > > the same version FC6 uses and install it at my Gentoo. I'll use the > > same hardware (diferent partitions to each). > > Wrong approach. You'll be comparing apples to oranges, because Gentoo > and Fedora pluck sources from different points in the source code > stream. > > In order to prove what you want to prove, you need to run the > benchmarks all on Gentoo, where you run with 4 categorizations: > > 1. Where you run PostgreSQL and GLIBC without any processor-specific > optimizations > > 2. Where you run PostgreSQL and GLIBC with all relevant > processor-specific optimizations > > 3. Where you run PostgreSQL with, and GLIBC without > processor-specific optimizations > > 4. Where you run PostgreSQL without, and GLIBC with processor-specific > optimizations > > That would allow one to clearly distinguish which optimizations are > particularly relevant. Good ideia also. And it is much easier to do as well.
Ron wrote: > We are not going to get valuable contributions nor help people become > more valuable to the community by "flaming them into submission". > > Let's support getting definitive evidence. No matter who brings it to > the table ;-) Thanks, Ron, for a voice of respect and reason. Since I first started using Usenet back in 1984, inexplicable rudeness hasbeen a plague on otherwise civilized people. We're a community, we're all in this to help one another. Sometimes wegive good advice, and sometimes even those "wearing the mantle of authority" can make boneheaded comments. I know I do,and when it happens, I always appreciate it when I'm taken to task with good humor and tolerance. When someone comes to this forum with an idea you disagree with, no matter how brash or absurd their claims, it's so easyto challenge them with grace and good humor, rather than chastizing with harsh words and driving someone from our community. If you're right, you will have taught a valuable lesson to someone. And if on occasion a newcomer shows us somethingnew, then we've all learned. Either way, we have a new friend and contributor to the community. Craig
Daniel van Ham Colchete wrote: > On 12/11/06, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote: >> On Mon, Dec 11, 2006 at 11:31:48AM -0200, Daniel van Ham Colchete wrote: >> > What PostgreSQL benchmark software should I use??? >> >> Look up the list archives; search for "TPC". >> >> > I'll test PostgreSQL 8.1 on a Fedora Core 6 and on a Gentoo. I'll get >> > the same version FC6 uses and install it at my Gentoo. I'll use the >> > same hardware (diferent partitions to each). >> >> Why do you want to compare FC6 and Gentoo? Wasn't your point that the >> -march= >> was supposed to be the relevant factor here? In that case, you want to >> keep >> all other things equal; so use the same distribution, only with -O2 >> -march=i686 vs. -march=athlon-xp (or whatever). >> >> /* Steinar */ > Using Gentoo is just a easy way to make cflag optimizations to all the > other libs as well: glibc, ... > > I can also mesure performance on Gentoo with cflag optimized > PostgreSQL and plain PostgreSQL as well. > I can certainly recall that when I switched from Fedora Core (2 or 3 can't recall now) to Gentoo that the machine was "faster" for many activities. Of course I can't recall precisely what now :-(. To actually track down *why* and *what* make it faster is another story, and custom CFLAGS is only 1 of the possible factors: others could be: - different kernel versions (Gentoo would have possibly been later) - different kernel patches (both RedHat and Gentoo patch 'em) - different versions of glibc (Gentoo possibly later again). - different config options for glibc (not sure if they in fact are, but it's possible...) - kernel and glibc built with different versions of gcc (again I suspect Gentoo may have used a later version) So there are a lot of variables to consider if you want to settle this debate once and for all :-)! Best wishes Mark
Hi yall, I made some preliminary tests. Before the results, I would like to make some acknowledgments: 1 - I didn't show any prove to any of the things I said until now. 2 - It really is a waste of everyone's time to say one thing when I can't prove it. But all I said, is the knowledge I have been acumulating over the past few years working on a project where optimization is important. After algorithmic optimizations, compiler options is the second on my list and with my software they show measurable improvement. With the other software I use, they seen to run faster, but I didn't measure it. TEST PROCEDURE ================ I ran this test at a Gentoo test machine I have here. It's a Pentium 4 3.0GHz (I don't know witch P4) with 1 GB of RAM memory. It only uses SATA drives. I didn't changed my glibc (or any other lib) during the test. I used GCC 3.4.6. I ran each test three times. So we can get an idea about average values and standard deviation. Each time I ran the test with the following commands: dropdb mydb createdb mydb pgbench -i -s 10 mydb 2> /dev/null psql -c 'vacuum analyze' mydb psql -c 'checkpoint' mydb sync pgbench -v -n -t 600 -c 5 mydb My postgresql.conf was the default one, except for: fsync = <depends on the test> shared_buffers = 10000 work_mem = 10240 Every test results should begin the above, but I removed it because it's always the same: transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 5 number of transactions per client: 600 number of transactions actually processed: 3000/3000 TESTS RESULTS ============== TEST 01: CFLAGS="-O2 -march=i686" fsync=false tps = 734.948620 (including connections establishing) tps = 736.866642 (excluding connections establishing) tps = 713.225047 (including connections establishing) tps = 715.039059 (excluding connections establishing) tps = 721.769259 (including connections establishing) tps = 723.631065 (excluding connections establishing) TEST 02: CFLAGS="-O2 -march=i686" fsync=true tps = 75.466058 (including connections establishing) tps = 75.485675 (excluding connections establishing) tps = 75.115797 (including connections establishing) tps = 75.135311 (excluding connections establishing) tps = 73.883113 (including connections establishing) tps = 73.901997 (excluding connections establishing) TEST 03: CFLAGS="-O2 -march=pentium4" fsync=false tps = 846.337784 (including connections establishing) tps = 849.067017 (excluding connections establishing) tps = 829.476269 (including connections establishing) tps = 832.008129 (excluding connections establishing) tps = 831.416457 (including connections establishing) tps = 835.300001 (excluding connections establishing) TEST 04 CFLAGS="-O2 -march=pentium4" fsync=true tps = 83.224016 (including connections establishing) tps = 83.248157 (excluding connections establishing) tps = 80.811892 (including connections establishing) tps = 80.834525 (excluding connections establishing) tps = 80.671406 (including connections establishing) tps = 80.693975 (excluding connections establishing) CONCLUSIONS Everyone can get their own conclusion. Mine is: 1 - You have improvement when you compile your postgresql using processor specific tecnologies. With the fsync the you have an improvement of 9% at the tps rate. Without the fsync, the improvement is of 15,6%. 2 - You can still improve your indexes, sqls and everythingelse, this only adds another possible improvment. 3 - I can't prove this but I *think* that this is related to the fact that GCC knows how to do the same thing better on each processor. 4 - I'm still using source-based distros. WHAT NOW There are other things I wish to test: 1 - What efect an optimized glibc has on PostgreSQL? 2 - How much improvement can I get playing with my postgresql.conf. 3 - What efect optimizations have with concurrency? 4 - What if I used Intel C++ Compiler instead of GCC? 5 - What if I use GCC 4.1.1 instead of GCC 3.4.6? I'm thinking about writing a script to make all the tests (more than 3 times each), get the data and plot some graphs. I don't have the time right now to do it, maybe next week I'll have. I invite everyone to comment/sugest on the procedure or the results. Best regards, Daniel Colchete
Daniel, Good stuff. Can you try this with just "-O3" versus "-O2"? - Luke On 12/11/06 2:22 PM, "Daniel van Ham Colchete" <daniel.colchete@gmail.com> wrote: > Hi yall, > > I made some preliminary tests. > > Before the results, I would like to make some acknowledgments: > 1 - I didn't show any prove to any of the things I said until now. > 2 - It really is a waste of everyone's time to say one thing when I > can't prove it. > > But all I said, is the knowledge I have been acumulating over the past > few years working on a project where optimization is important. After > algorithmic optimizations, compiler options is the second on my list > and with my software they show measurable improvement. With the other > software I use, they seen to run faster, but I didn't measure it. > > TEST PROCEDURE > ================ > I ran this test at a Gentoo test machine I have here. It's a Pentium 4 > 3.0GHz (I don't know witch P4) with 1 GB of RAM memory. It only uses > SATA drives. I didn't changed my glibc (or any other lib) during the > test. I used GCC 3.4.6. > > I ran each test three times. So we can get an idea about average > values and standard deviation. > > Each time I ran the test with the following commands: > dropdb mydb > createdb mydb > pgbench -i -s 10 mydb 2> /dev/null > psql -c 'vacuum analyze' mydb > psql -c 'checkpoint' mydb > sync > pgbench -v -n -t 600 -c 5 mydb > > My postgresql.conf was the default one, except for: > fsync = <depends on the test> > shared_buffers = 10000 > work_mem = 10240 > > Every test results should begin the above, but I removed it because > it's always the same: > transaction type: TPC-B (sort of) > scaling factor: 10 > number of clients: 5 > number of transactions per client: 600 > number of transactions actually processed: 3000/3000 > > TESTS RESULTS > ============== > TEST 01: CFLAGS="-O2 -march=i686" fsync=false > > tps = 734.948620 (including connections establishing) > tps = 736.866642 (excluding connections establishing) > > tps = 713.225047 (including connections establishing) > tps = 715.039059 (excluding connections establishing) > > tps = 721.769259 (including connections establishing) > tps = 723.631065 (excluding connections establishing) > > > TEST 02: CFLAGS="-O2 -march=i686" fsync=true > > tps = 75.466058 (including connections establishing) > tps = 75.485675 (excluding connections establishing) > > tps = 75.115797 (including connections establishing) > tps = 75.135311 (excluding connections establishing) > > tps = 73.883113 (including connections establishing) > tps = 73.901997 (excluding connections establishing) > > > TEST 03: CFLAGS="-O2 -march=pentium4" fsync=false > > tps = 846.337784 (including connections establishing) > tps = 849.067017 (excluding connections establishing) > > tps = 829.476269 (including connections establishing) > tps = 832.008129 (excluding connections establishing) > > tps = 831.416457 (including connections establishing) > tps = 835.300001 (excluding connections establishing) > > > TEST 04 CFLAGS="-O2 -march=pentium4" fsync=true > > tps = 83.224016 (including connections establishing) > tps = 83.248157 (excluding connections establishing) > > tps = 80.811892 (including connections establishing) > tps = 80.834525 (excluding connections establishing) > > tps = 80.671406 (including connections establishing) > tps = 80.693975 (excluding connections establishing) > > > CONCLUSIONS > Everyone can get their own conclusion. Mine is: > > 1 - You have improvement when you compile your postgresql using > processor specific tecnologies. With the fsync the you have an > improvement of 9% at the tps rate. Without the fsync, the improvement > is of 15,6%. > > 2 - You can still improve your indexes, sqls and everythingelse, this > only adds another possible improvment. > > 3 - I can't prove this but I *think* that this is related to the fact > that GCC knows how to do the same thing better on each processor. > > 4 - I'm still using source-based distros. > > WHAT NOW > There are other things I wish to test: > 1 - What efect an optimized glibc has on PostgreSQL? > 2 - How much improvement can I get playing with my postgresql.conf. > 3 - What efect optimizations have with concurrency? > 4 - What if I used Intel C++ Compiler instead of GCC? > 5 - What if I use GCC 4.1.1 instead of GCC 3.4.6? > > I'm thinking about writing a script to make all the tests (more than 3 > times each), get the data and plot some graphs. > > I don't have the time right now to do it, maybe next week I'll have. > > I invite everyone to comment/sugest on the procedure or the results. > > Best regards, > Daniel Colchete > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
On Mon, Dec 11, 2006 at 08:22:42PM -0200, Daniel van Ham Colchete wrote: >TEST 01: CFLAGS="-O2 -march=i686" fsync=false > >tps = 734.948620 (including connections establishing) >tps = 736.866642 (excluding connections establishing) [snip] >TEST 03: CFLAGS="-O2 -march=pentium4" fsync=false > >tps = 846.337784 (including connections establishing) >tps = 849.067017 (excluding connections establishing) Can anyone else reproduce these results? I'm on similar hardware (2.5GHz P4, 1.5G RAM) and my test results are more like this: (postgresql 8.2.0) CFLAGS=(default) tps = 527.300454 (including connections establishing) tps = 528.898671 (excluding connections establishing) tps = 517.874347 (including connections establishing) tps = 519.404970 (excluding connections establishing) tps = 534.934905 (including connections establishing) tps = 536.562150 (excluding connections establishing) CFLAGS=686 tps = 525.179375 (including connections establishing) tps = 526.801278 (excluding connections establishing) tps = 557.821136 (including connections establishing) tps = 559.602414 (excluding connections establishing) tps = 532.142941 (including connections establishing) tps = 533.740209 (excluding connections establishing) CFLAGS=pentium4 tps = 518.869825 (including connections establishing) tps = 520.394341 (excluding connections establishing) tps = 537.759982 (including connections establishing) tps = 539.402547 (excluding connections establishing) tps = 538.522198 (including connections establishing) tps = 540.200458 (excluding connections establishing) Mike Stone
After a long battle with technology, mstone+postgres@mathom.us (Michael Stone), an earthling, wrote: > [1] I will say that I have never seen a realistic benchmark of > general code where the compiler flags made a statistically > significant difference in the runtime. When we were initially trying out PostgreSQL on AIX, I did some (limited, admittedly) comparisons between behaviour when compiled using GCC 3.something, VisualAge C, and VisualAge C++. I did some modifications of -O values; I didn't find differences amounting to more than a percent or two between any of the combinations. If there's to be a difference, anywhere, it ought to have figured pretty prominently between a pretty elderly GCC version and IBM's top of the line PowerPC compiler. -- output = reverse("gro.mca" "@" "enworbbc") http://cbbrowne.com/info/linuxdistributions.html "I doubt this language difference would confuse anybody unless you were providing instructions on the insertion of a caffeine enema." -- On alt.coffee
On Mon, 11 Dec 2006, Michael Stone wrote: > Can anyone else reproduce these results? I'm on similar hardware (2.5GHz P4, > 1.5G RAM)... There are two likely candidates for why Daniel's P4 3.0GHz significantly outperforms your 2.5GHz system. 1) Most 2.5GHZ P4 processors use a 533MHz front-side bus (FSB); most 3.0GHZ ones use an 800MHz bus. 2) A typical motherboard paired with a 2.5GHz era processor will have a single-channel memory interface; a typical 3.0GHZ era board supports dual-channel DDR. These changes could easily explain the magnitude of difference in results you're seeing, expecially when combined with a 20% greater raw CPU clock. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > On Mon, 11 Dec 2006, Michael Stone wrote: >> Can anyone else reproduce these results? I'm on similar hardware (2.5GHz P4, >> 1.5G RAM)... > There are two likely candidates for why Daniel's P4 3.0GHz significantly > outperforms your 2.5GHz system. Um, you entirely missed the point: the hardware speedups you mention are quite independent of any compiler options. The numbers we are looking at are the relative speeds of two different compiles on the same hardware, not whether hardware A is faster than hardware B. regards, tom lane
Mike, are you using "-mtune/-mcpu" or "-march" with GCC? Witch GCC version? Are you working with a 32bits OS or 64bits? Daniel On 12/11/06, Michael Stone <mstone+postgres@mathom.us> wrote: > > Can anyone else reproduce these results? I'm on similar hardware (2.5GHz > P4, 1.5G RAM) and my test results are more like this: > > (postgresql 8.2.0) > > CFLAGS=(default) > > tps = 527.300454 (including connections establishing) > tps = 528.898671 (excluding connections establishing) > > tps = 517.874347 (including connections establishing) > tps = 519.404970 (excluding connections establishing) > > tps = 534.934905 (including connections establishing) > tps = 536.562150 (excluding connections establishing) > > CFLAGS=686 > > tps = 525.179375 (including connections establishing) > tps = 526.801278 (excluding connections establishing) > > tps = 557.821136 (including connections establishing) > tps = 559.602414 (excluding connections establishing) > > tps = 532.142941 (including connections establishing) > tps = 533.740209 (excluding connections establishing) > > > CFLAGS=pentium4 > > tps = 518.869825 (including connections establishing) > tps = 520.394341 (excluding connections establishing) > > tps = 537.759982 (including connections establishing) > tps = 539.402547 (excluding connections establishing) > > tps = 538.522198 (including connections establishing) > tps = 540.200458 (excluding connections establishing) > > Mike Stone > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org >
On 12.12.2006, at 02:37, Michael Stone wrote: > Can anyone else reproduce these results? I'm on similar hardware > (2.5GHz P4, 1.5G RAM) and my test results are more like this: I'm on totally different hardware / software (MacBook Pro 2.33GHz C2D) and I can't reproduce the tests. I have played with a lot of settings in the CFLAGS including -march and -O3 and -O2 - there is no significant difference in the tests. With fsync=off I get around 2100tps on average with all different settings I have tested. I tried to get the rest of the setup as similar to the described on ty Daniel as possible. It might be that the crappy Pentium 4 needs some special handling, but I can't get the Core 2 Duo in my laptop produce different tps numbers with the different optimizations. Btw: best results were 2147 with -march=i686 and 2137 with - march=nocona. Both with -O3. cug
On Dec 11, 2006, at 23:22 , Daniel van Ham Colchete wrote: > I ran this test at a Gentoo test machine I have here. It's a Pentium 4 > 3.0GHz (I don't know witch P4) Try cat /proc/cpuinfo. > TESTS RESULTS > ============== On a dual-core Opteron 280 with 4G RAM with an LSI PCI-X Fusion-MPT SAS controller, I am getting wildly uneven results: tps = 264.775137 (excluding connections establishing) tps = 160.365754 (excluding connections establishing) tps = 151.967193 (excluding connections establishing) tps = 148.010349 (excluding connections establishing) tps = 260.973569 (excluding connections establishing) tps = 144.693287 (excluding connections establishing) tps = 148.147036 (excluding connections establishing) tps = 259.485717 (excluding connections establishing) I suspect the hardware's real maximum performance of the system is ~150 tps, but that the LSI's write cache is buffering the writes. I would love to validate this hypothesis, but I'm not sure how. Alexander.
On Tue, Dec 12, 2006 at 01:35:04AM -0500, Greg Smith wrote: >These changes could easily explain the magnitude of difference in results >you're seeing, expecially when combined with a 20% greater raw CPU clock. I'm not interested in comparing the numbers between the systems (which is obviously pointless); I am intested in the fact that there was a consistent difference among the numbers on his system (based on difference optimizations) and no difference among mine. Mike Stone
On Tue, Dec 12, 2006 at 07:10:34AM -0200, Daniel van Ham Colchete wrote: >are you using "-mtune/-mcpu" or "-march" with GCC? I used exactly the options you said you used. >Witch GCC version? Are you working with a 32bits OS or 64bits? 3.3.5; 32 Mike Stone
On Tue, Dec 12, 2006 at 12:29:29PM +0100, Alexander Staubo wrote: >I suspect the hardware's real maximum performance of the system is >~150 tps, but that the LSI's write cache is buffering the writes. I >would love to validate this hypothesis, but I'm not sure how. With fsync off? The write cache shouldn't really matter in that case. (And for this purpose that's probably a reasonable configuration.) Mike Stone
Luke Lonergan wrote: > Can you try this with just "-O3" versus "-O2"? Thanks to Daniel for doing these tests. I happen to have done the same tests about 3/4 years ago, and concluded that gcc flags did *not* influence performance. Moved by curiosity, I revamped those tests now on a test machine (single P4 @ 3.2 Ghz, with 2Mb cache and 512 Mb Ram). Here are the results: http://www.streppone.it/cosimo/work/pg/gcc.png In short: tests executed with postgresql 8.2.0, gcc version 3.4.3 20041212 (Red Hat 3.4.3-9.EL4), tps figures computed as average of 9 pgbench runs (don't ask why 9... :-), with exactly the same commands given by Daniel: "-O0" ~ 957 tps "-O1 -mcpu=pentium4 -mtune=pentium4" ~ 1186 tps "-O2 -mcpu=pentium4 -mtune=pentium4" ~ 1229 tps "-O3 -mcpu=pentium4 -mtune=pentium4" ~ 1257 tps "-O6 -mcpu=pentium4 -mtune=pentium4" ~ 1254 tps I'm curious now to get the same tests run with a custom-cflags-compiled glibc. -- Cosimo
On Tue, Dec 12, 2006 at 01:42:06PM +0100, Cosimo Streppone wrote: > "-O0" ~ 957 tps > "-O1 -mcpu=pentium4 -mtune=pentium4" ~ 1186 tps > "-O2 -mcpu=pentium4 -mtune=pentium4" ~ 1229 tps > "-O3 -mcpu=pentium4 -mtune=pentium4" ~ 1257 tps > "-O6 -mcpu=pentium4 -mtune=pentium4" ~ 1254 tps For the record, -O3 = -O6 for regular gcc. It used to matter for pgcc, but that is hardly in use anymore. /* Steinar */ -- Homepage: http://www.sesse.net/
On Tue, Dec 12, 2006 at 01:42:06PM +0100, Cosimo Streppone wrote: >"-O0" ~ 957 tps >"-O1 -mcpu=pentium4 -mtune=pentium4" ~ 1186 tps >"-O2 -mcpu=pentium4 -mtune=pentium4" ~ 1229 tps >"-O3 -mcpu=pentium4 -mtune=pentium4" ~ 1257 tps >"-O6 -mcpu=pentium4 -mtune=pentium4" ~ 1254 tps > >I'm curious now to get the same tests run with >a custom-cflags-compiled glibc. I'd be curious to see -O2 with and without the arch-specific flags, since that's mostly what the discussion is about. Mike Stone
On Tue, Dec 12, 2006 at 07:48:06AM -0500, Michael Stone wrote: >I'd be curious to see -O2 with and without the arch-specific flags, >since that's mostly what the discussion is about. That came across more harshly than I intended; I apologize for that. It's certainly a useful data point to compare the various optimization levels. I'm rerunning the tests with gcc 4.1.2 and another platform to see if that makes any difference. Mike Stone
On Mon, 2006-12-11 at 20:22 -0200, Daniel van Ham Colchete wrote: > > I'm thinking about writing a script to make all the tests (more than 3 > times each), get the data and plot some graphs. > > I don't have the time right now to do it, maybe next week I'll have. Check out the OSDL test suite stuff. It runs the test and handles all the reporting for you (including graphs). http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/ > I invite everyone to comment/sugest on the procedure or the results. I'd recommend running each test for quite a bit longer. Get your buffers dirty and exercised, and see what things look like then. Also, if you have the disk, offload your wal files to a separate disk. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On Dec 12, 2006, at 13:32 , Michael Stone wrote: > On Tue, Dec 12, 2006 at 12:29:29PM +0100, Alexander Staubo wrote: >> I suspect the hardware's real maximum performance of the system >> is ~150 tps, but that the LSI's write cache is buffering the >> writes. I would love to validate this hypothesis, but I'm not >> sure how. > > With fsync off? The write cache shouldn't really matter in that > case. (And for this purpose that's probably a reasonable > configuration.) No, fsync=on. The tps values are similarly unstable with fsync=off, though -- I'm seeing bursts of high tps values followed by low-tps valleys, a kind of staccato flow indicative of a write caching being filled up and flushed. Alexander.
1= In all these results I'm seeing, no one has yet reported what their physical IO subsystem is... ...when we are benching a DB. 2= So far we've got ~ a factor of 4 performance difference between Michael Stone's 1S 1C Netburst era 2.5GHz P4 PC and Guido Neitzer's 1S 2C MacBook Pro 2.33GHz C2D. If the physical IO subsystems are even close to equivalent across the systems benched so far, we've clearly established that pg performance is more sensitive to factors outside the physical IO subsystem than might usually be thought with regard to a DBMS. (At least for this benchmark SW.) 3= Daniel van Ham Colchete is running Gentoo. That means every SW component on his box has been compiled to be optimized for the HW it is running on. There may be a combination of effects going on for him that others not running a system optimized from the ground up for its HW do not see. 4= If we are testing arch specific compiler options and only arch specific compiler options, we should remove the OS as a variable. Since Daniel has presented evidence in support of his hypothesis, the first step should be to duplicate his environment as =exactly= as possible and see if someone can independently reproduce the results when the only significant difference is the human involved. This will guard against procedural error in the experiment. Possible Outcomes A= Daniel made a procedural error. We all learn what is and to avoid it. B= The Gentoo results are confirmed but no other OS shows this effect. Much digging ensues ;-) C= Daniel's results are confirmed as platform independent once we take all factor into account properly We all learn more re: how to best set up pg for highest performance. Ron Peacetree At 01:35 AM 12/12/2006, Greg Smith wrote: >On Mon, 11 Dec 2006, Michael Stone wrote: > >>Can anyone else reproduce these results? I'm on similar hardware >>(2.5GHz P4, 1.5G RAM)... > >There are two likely candidates for why Daniel's P4 3.0GHz >significantly outperforms your 2.5GHz system. > >1) Most 2.5GHZ P4 processors use a 533MHz front-side bus (FSB); most >3.0GHZ ones use an 800MHz bus. > >2) A typical motherboard paired with a 2.5GHz era processor will >have a single-channel memory interface; a typical 3.0GHZ era board >supports dual-channel DDR. > >These changes could easily explain the magnitude of difference in >results you're seeing, expecially when combined with a 20% greater >raw CPU clock.
* Cosimo Streppone: > "-O0" ~ 957 tps > "-O1 -mcpu=pentium4 -mtune=pentium4" ~ 1186 tps > "-O2 -mcpu=pentium4 -mtune=pentium4" ~ 1229 tps > "-O3 -mcpu=pentium4 -mtune=pentium4" ~ 1257 tps > "-O6 -mcpu=pentium4 -mtune=pentium4" ~ 1254 tps -mcpu and -mtune are synonymous. You really should -march here (but the result is non-generic code). Keep in mind that GCC does not contain an instruction scheduler for the Pentium 4s. I also believe that the GCC switches are not fine-grained enough to cover the various Pentium 4 variants. For instance, some chips don't like the CMOV instruction at all, but others can process it with decent speed. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
In response to Ron <rjpeace@earthlink.net>: > > 3= Daniel van Ham Colchete is running Gentoo. That means every SW > component on his box has been compiled to be optimized for the HW it > is running on. > There may be a combination of effects going on for him that others > not running a system optimized from the ground up for its HW do not see. http://www.potentialtech.com/wmoran/source.php You get an idea of how old these tests are by the fact that the latest and greatest was FreeBSD 4.9 at the time, but I suppose it may still be relevent. -- Bill Moran Collaborative Fusion Inc.
Alexander Staubo <alex@purefiction.net> writes: > No, fsync=on. The tps values are similarly unstable with fsync=off, > though -- I'm seeing bursts of high tps values followed by low-tps > valleys, a kind of staccato flow indicative of a write caching being > filled up and flushed. It's notoriously hard to get repeatable numbers out of pgbench :-( A couple of tips: * don't put any faith in short runs. I usually use -t 1000 plus -c whatever. * make sure you loaded the database (pgbench -i) with a scale factor (-s) at least equal to the maximum -c you want to test. Otherwise you're mostly measuring update contention. * pay attention to when checkpoints occur. You probably need to increase checkpoint_segments if you want pgbench not to be checkpoint-bound. regards, tom lane
On 12/12/06, Florian Weimer <fweimer@bfk.de> wrote: > * Cosimo Streppone: > > > "-O0" ~ 957 tps > > "-O1 -mcpu=pentium4 -mtune=pentium4" ~ 1186 tps > > "-O2 -mcpu=pentium4 -mtune=pentium4" ~ 1229 tps > > "-O3 -mcpu=pentium4 -mtune=pentium4" ~ 1257 tps > > "-O6 -mcpu=pentium4 -mtune=pentium4" ~ 1254 tps > > -mcpu and -mtune are synonymous. You really should -march here (but > the result is non-generic code). Keep in mind that GCC does not > contain an instruction scheduler for the Pentium 4s. I also believe > that the GCC switches are not fine-grained enough to cover the various > Pentium 4 variants. For instance, some chips don't like the CMOV > instruction at all, but others can process it with decent speed. You can use -march=pentium4, -march=prescott and -march=nocona to the different Pentium4 processors. But you have to use -march (and not -mcpu or -mtune) because without it you are still using only i386 instructions. Daniel
Alexander Staubo wrote: > No, fsync=on. The tps values are similarly unstable with fsync=off, > though -- I'm seeing bursts of high tps values followed by low-tps > valleys, a kind of staccato flow indicative of a write caching being > filled up and flushed. Databases with checkpointing typically exhibit this cyclical throughput syndrome. (put another way : this is to be expected and you are correct that it indicates buffered data being flushed to disk periodically).
On Tue, 12 Dec 2006, Tom Lane wrote: > Um, you entirely missed the point: the hardware speedups you mention are > quite independent of any compiler options. The numbers we are looking > at are the relative speeds of two different compiles on the same > hardware, not whether hardware A is faster than hardware B. The point that I failed to make clear is that expecting Mike's system to perform like Daniel's just because they have similar processors isn't realistic, considering the changes that happened in the underlying hardware during that period. Having very different memory subsystems will shift which optimizations are useful and which have minimal impact even if the processor is basically the same. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Tom Lane wrote: > Alexander Staubo <alex@purefiction.net> writes: > > No, fsync=on. The tps values are similarly unstable with fsync=off, > > though -- I'm seeing bursts of high tps values followed by low-tps > > valleys, a kind of staccato flow indicative of a write caching being > > filled up and flushed. > > It's notoriously hard to get repeatable numbers out of pgbench :-( > > A couple of tips: > * don't put any faith in short runs. I usually use -t 1000 > plus -c whatever. > * make sure you loaded the database (pgbench -i) with a scale > factor (-s) at least equal to the maximum -c you want to test. > Otherwise you're mostly measuring update contention. > * pay attention to when checkpoints occur. You probably need > to increase checkpoint_segments if you want pgbench not to be > checkpoint-bound. While skimming over the pgbench source it has looked to me like it's necessary to pass the -s switch (scale factor) to both the initialization (-i) and the subsequent (non -i) runs. I'm not sure if this is obvious from the documentation but I thought it may be useful to mention.
On Tue, 12 Dec 2006, Alvaro Herrera wrote: > While skimming over the pgbench source it has looked to me like it's > necessary to pass the -s switch (scale factor) to both the > initialization (-i) and the subsequent (non -i) runs. For non-custom runs, it's computed based on the number of branches. Around line 1415 you should find: res = PQexec(con, "select count(*) from branches"); ... scale = atoi(PQgetvalue(res, 0, 0)); So it shouldn't be required during the run, just the initialization. However, note that there were some recent bug fixes to the scaling implementation, and I would recommend using the version that comes with 8.2 (pgbench 1.58 2006/10/21). It may compile fine even if you copy that pgbench.c into an older version's contrib directory; it's certainly a drop-in replacement (and improvement) for the pgbench 1.45 that comes with current Postgres 8.1 versions. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Mike, I'm making some other tests here at another hardware (also Gentoo). I found out that PostgreSQL stops for a while if I change the -t parameter on bgbench from 600 to 1000 and I have ~150 tps instead of ~950tps. I don't know why PostgreSQL stoped, but it was longer than 5 seconds and my disk IO was comsuming 100% of my CPU time during this period. And I'm testing with my fsync turned off. Maybe if you lower your -t rate you are going to see this improvement. Best regards, Daniel On 12/12/06, Michael Stone <mstone+postgres@mathom.us> wrote: > On Tue, Dec 12, 2006 at 07:10:34AM -0200, Daniel van Ham Colchete wrote: > >are you using "-mtune/-mcpu" or "-march" with GCC? > > I used exactly the options you said you used. > > >Witch GCC version? Are you working with a 32bits OS or 64bits? > > 3.3.5; 32 > > Mike Stone > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
Alvaro Herrera <alvherre@commandprompt.com> writes: > While skimming over the pgbench source it has looked to me like it's > necessary to pass the -s switch (scale factor) to both the > initialization (-i) and the subsequent (non -i) runs. No, it's not supposed to be, and I've never found it needed in practice. The code seems able to pull the scale out of the database (I forget how it figures it out exactly). regards, tom lane
"Daniel van Ham Colchete" <daniel.colchete@gmail.com> writes: > I'm making some other tests here at another hardware (also Gentoo). I > found out that PostgreSQL stops for a while if I change the -t > parameter on bgbench from 600 to 1000 and I have ~150 tps instead of > ~950tps. > I don't know why PostgreSQL stoped, but it was longer than 5 seconds > and my disk IO was comsuming 100% of my CPU time during this period. Checkpoint? regards, tom lane
At 10:47 AM 12/12/2006, Tom Lane wrote: >It's notoriously hard to get repeatable numbers out of pgbench :-( That's not a good characteristic in bench marking SW... Does the ODSL stuff have an easier time getting reproducible results? >A couple of tips: > * don't put any faith in short runs. I usually use -t 1000 > plus -c whatever. > * make sure you loaded the database (pgbench -i) with a > scale factor (-s) at least equal to the maximum -c you want to test. > Otherwise you're mostly measuring update contention. > * pay attention to when checkpoints occur. You probably > need to increase checkpoint_segments if you want pgbench not to be > checkpoint-bound. This all looks very useful. Can you give some guidance as to what checkpoint_segments should be increased to? Do the values you are running pgbench with suggest what value checkpoint_segments should be? Ron Peacetree
I just made another test with a second Gentoo machine: Pentium 4 3.0Ghz Prescott GCC 4.1.1 Glibc 2.4 PostgreSQL 8.1.5 Kernel 2.6.17 Same postgresql.conf as yesterday's. First test ========== GLIBC: -O2 -march=i686 PostgreSQL: -O2 -march=i686 Results: 974.638731 975.602142 975.882051 969.142503 992.914167 983.467131 983.231575 994.901330 970.375221 978.377467 Average (error): 980 tps (13 tps) Second test =========== GLIBC: -O2 -march=i686 PostgreSQL: -O2 -march=prescott Results: 988.319643 976.152973 1006.482553 992.431322 983.090838 992.674065 989.216746 990.897615 987.129802 975.907955 Average (error): 988 tps (15 tps) Third test ========== GLIBC: -O2 -march=prescott PostgreSQL: -O2 -march=i686 Results: 969.085400 966.187309 994.882325 968.715150 956.766771 970.151542 960.090571 967.680628 986.568462 991.756520 Average (error): 973 tps (19 tps) Forth test ========== GLIBC: -O2 -march=prescott PostgreSQL: -O2 -march=prescott Results: 980.888371 978.128269 969.344669 978.021509 979.256603 993.236457 984.078399 981.654834 976.295925 969.796277 Average (error): 979 tps (11 tps) The results showed no significant change. The conclusion of today's test would be that there are no improvement at PostgreSQL when using -march=prescott. I only see 3 diferences between yesterday's server and today's: the kernel version (y: 2.6.18, t:2.6.17), the server uses an IDE harddrive (yesterday was SATA), and the gcc version (3.4.6 -> 4.1.1). I don't know why yesterday we had improved and today we had not. Best Daniel On 12/12/06, Daniel van Ham Colchete <daniel.colchete@gmail.com> wrote: > I'm making some other tests here at another hardware (also Gentoo). I > found out that PostgreSQL stops for a while if I change the -t > parameter on bgbench from 600 to 1000 and I have ~150 tps instead of > ~950tps.
At 01:35 PM 12/12/2006, Daniel van Ham Colchete wrote: >I just made another test with a second Gentoo machine: ><snip> > >The results showed no significant change. The conclusion of today's >test would be that there are no improvement at PostgreSQL when using >-march=prescott. > >I only see 3 diferences between yesterday's server and today's: the >kernel version (y: 2.6.18, t:2.6.17), the server uses an IDE >harddrive (yesterday was SATA), and the gcc version (3.4.6 -> 4.1.1). > >I don't know why yesterday we had improved and today we had not. SATA HD's, particularly SATA II HD's and _especially_ 10Krpm 150GB SATA II Raptors are going to have far better performance than older IDE HDs. Do some raw bonnie++ benches on the two systems. If the numbers from bonnie++ are close to those obtained during the pgbench runs, then the HDs are limiting pgbench. Best would be to use the exact same HD IO subsystem on both boxes, but that may not be feasible. In general, it would be helpful if the entire config, HW + OS + pg stuff, was documented when submitting benchmark results. (For instance, it would not be outside the realm of plausibility for Guidos C2D laptop to be HD IO limited and for Micheal's 2.5 GHZ P4 PC to be CPU limited during pgbench runs.) Ron Peacetree
> I just made another test with a second Gentoo machine: > > Pentium 4 3.0Ghz Prescott > GCC 4.1.1 > Glibc 2.4 > PostgreSQL 8.1.5 > Kernel 2.6.17 > > Same postgresql.conf as yesterday's. > > First test > ========== > GLIBC: -O2 -march=i686 > PostgreSQL: -O2 -march=i686 > Results: 974.638731 975.602142 975.882051 969.142503 992.914167 > 983.467131 983.231575 994.901330 970.375221 978.377467 > Average (error): 980 tps (13 tps) Do you have any scripts for the above process you could share with the list? I have a few machines (a woodcrest and an older Xeon) that I could run this on... Thanks, Bucky
On Tue, 12 Dec 2006, Daniel van Ham Colchete wrote: > I'm making some other tests here at another hardware (also Gentoo). I > found out that PostgreSQL stops for a while if I change the -t > parameter on bgbench from 600 to 1000 and I have ~150 tps instead of > ~950tps. Sure sounds like a checkpoint to me; the ones pgbench generates really aren't fun to watch when running against IDE drives. I've seen my test system with 2 IDE drives pause for 15 seconds straight to process one when fsync is on, caching was disabled on the WAL disk, and the shared_buffer cache is large. If you were processing 600 transactions/client without hitting a checkpoint but 1000 is, try editing your configuration file, double checkpoint_segments, restart the server, and then try again. This is cheating but will prove the source of the problem. This kind of behavior is what other list members were trying to suggest to you before: once you get disk I/O involved, that drives the performance characteristics of so many database operations that small improvements in CPU optimization are lost. Running the regular pgbench code is so wrapped in disk writes that it's practically a worst-case for what you're trying to show. I would suggest that you run all your optimization tests with the -S parameter to pgbench that limits it to select statements. That will let you benchmark whether the core code is benefitting from the CPU improvements without having disk I/O as the main driver of performance. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
alex@purefiction.net (Alexander Staubo) wrote: > On Dec 12, 2006, at 13:32 , Michael Stone wrote: > >> On Tue, Dec 12, 2006 at 12:29:29PM +0100, Alexander Staubo wrote: >>> I suspect the hardware's real maximum performance of the system is >>> ~150 tps, but that the LSI's write cache is buffering the >>> writes. I would love to validate this hypothesis, but I'm not >>> sure how. >> >> With fsync off? The write cache shouldn't really matter in that >> case. (And for this purpose that's probably a reasonable >> configuration.) > > No, fsync=on. The tps values are similarly unstable with fsync=off, > though -- I'm seeing bursts of high tps values followed by low-tps > valleys, a kind of staccato flow indicative of a write caching being > filled up and flushed. If that seems coincidental with checkpoint flushing, that would be one of the notable causes of that sort of phenomenon. You could get more readily comparable numbers either by: a) Increasing the frequency of checkpoint flushes, so they would be individually smaller, or b) Decreasing the frequency so you could exclude it from the time of the test. -- output = ("cbbrowne" "@" "gmail.com") http://cbbrowne.com/info/slony.html "It can be shown that for any nutty theory, beyond-the-fringe political view or strange religion there exists a proponent on the Net. The proof is left as an exercise for your kill-file." -- Bertil Jonell
Hi, Did someone try '-mfpmath=sse -msse3'? Would be interesting to know if -mfpmath=sse boost the performance. I guess, the difference in the generated code isn't that much between i686 and prescott. The bigger step is i386 to i686. '-mfpmath=sse -msse3' will also use the SSE unit, which the classic i686 doesn't have. CFLAGS=-O2 -march=prescott -mfpmath=sse -msse3 Best regards Sven. Daniel van Ham Colchete schrieb: > I just made another test with a second Gentoo machine: > > Pentium 4 3.0Ghz Prescott > GCC 4.1.1 > Glibc 2.4 > PostgreSQL 8.1.5 > Kernel 2.6.17 > > Same postgresql.conf as yesterday's. > > First test > ========== > GLIBC: -O2 -march=i686 > PostgreSQL: -O2 -march=i686 > Results: 974.638731 975.602142 975.882051 969.142503 992.914167 > 983.467131 983.231575 994.901330 970.375221 978.377467 > Average (error): 980 tps (13 tps) > > Second test > =========== > GLIBC: -O2 -march=i686 > PostgreSQL: -O2 -march=prescott > Results: 988.319643 976.152973 1006.482553 992.431322 983.090838 > 992.674065 989.216746 990.897615 987.129802 975.907955 > Average (error): 988 tps (15 tps) > > Third test > ========== > GLIBC: -O2 -march=prescott > PostgreSQL: -O2 -march=i686 > Results: 969.085400 966.187309 994.882325 968.715150 956.766771 > 970.151542 960.090571 967.680628 986.568462 991.756520 > Average (error): 973 tps (19 tps) > > Forth test > ========== > GLIBC: -O2 -march=prescott > PostgreSQL: -O2 -march=prescott > Results: 980.888371 978.128269 969.344669 978.021509 979.256603 > 993.236457 984.078399 981.654834 976.295925 969.796277 > Average (error): 979 tps (11 tps) > > The results showed no significant change. The conclusion of today's > test would be that there are no improvement at PostgreSQL when using > -march=prescott. > > I only see 3 diferences between yesterday's server and today's: the > kernel version (y: 2.6.18, t:2.6.17), the server uses an IDE harddrive > (yesterday was SATA), and the gcc version (3.4.6 -> 4.1.1). > > I don't know why yesterday we had improved and today we had not. > > Best > Daniel > > On 12/12/06, Daniel van Ham Colchete <daniel.colchete@gmail.com> wrote: >> I'm making some other tests here at another hardware (also Gentoo). I >> found out that PostgreSQL stops for a while if I change the -t >> parameter on bgbench from 600 to 1000 and I have ~150 tps instead of >> ~950tps.
Michael Stone wrote: > On Tue, Dec 12, 2006 at 01:42:06PM +0100, Cosimo Streppone wrote: >> "-O0" ~ 957 tps >> "-O1 -mcpu=pentium4 -mtune=pentium4" ~ 1186 tps >> "-O2 -mcpu=pentium4 -mtune=pentium4" ~ 1229 tps >> "-O3 -mcpu=pentium4 -mtune=pentium4" ~ 1257 tps >> "-O6 -mcpu=pentium4 -mtune=pentium4" ~ 1254 tps >> >> I'm curious now to get the same tests run with >> a custom-cflags-compiled glibc. > > I'd be curious to see -O2 with and without the arch-specific flags, > since that's mostly what the discussion is about. I run the same tests only for: 1) '-O2' 2) '-O2 -march=pentium4 -mtune=pentium4 -mcpu=pentium4' (so no more doubts here, and thanks for gcc hints :-) and I obtained respectively an average of 1238 (plain -O2) vs. 1229 tps on 9 runs. Disk subsystem is a standard desktop SATA, no more than that. I tried also recompiling *only* pgbench with various options, but as I expected (and hoped) nothing changed. Interesting, eh? -- Cosimo
At 11:11 AM 12/13/2006, Cosimo Streppone wrote: >Interesting, eh? > >Cosimo What I find interesting is that so far Guido's C2D Mac laptop has gotten the highest values by far in this set of experiments, and no one else is even close. The slowest results, Michael's, are on the system with what appears to be the slowest CPU of the bunch; and the ranking of the rest of the results seem to similarly depend on relative CPU performance. This is not what one would naively expect when benching a IO intensive app like a DBMS. Given that the typical laptop usually has 1 HD, and a relatively modest one at that (the fastest available are SATA 7200rpm or Seagate's perpendicular recording 5400rpm) in terms of performance, this feels very much like other factors are bottlenecking the experiments to the point where Daniel's results regarding compiler options are not actually being tested. Anyone got a 2.33 GHz C2D box with a decent HD IO subsystem more representative of a typical DB server hooked up to it? Again, the best way to confirm/deny Daniel's results is to duplicate the environment he obtained those results with as closely as possible (preferably exactly) and then have someone else try to duplicate his results. Also, I think the warnings regarding proper configuration of pgbench and which version of pgbench to use are worthy of note. Do we have guidance yet as to what checkpoint_segments should be set to? Should we be considering using something other than pgbench for such experiments? Ron Peacetree
> What I find interesting is that so far Guido's C2D Mac laptop has > gotten the highest values by far in this set of experiments, and no > one else is even close. > The slowest results, Michael's, are on the system with what appears > to be the slowest CPU of the bunch; and the ranking of the rest of > the results seem to similarly depend on relative CPU > performance. This is not what one would naively expect when benching > a IO intensive app like a DBMS. > > Given that the typical laptop usually has 1 HD, and a relatively > modest one at that (the fastest available are SATA 7200rpm or > Seagate's perpendicular recording 5400rpm) in terms of performance, > this feels very much like other factors are bottlenecking the > experiments to the point where Daniel's results regarding compiler > options are not actually being tested. > > Anyone got a 2.33 GHz C2D box with a decent HD IO subsystem more > representative of a typical DB server hooked up to it? I've only seen pg_bench numbers > 2,000 tps on either really large hardware (none of the above mentioned comes close) or the results are in memory due to a small database size (aka measuring update contention). Just a guess, but these tests (compiler opts.) seem like they sometimes show a benefit where the database is mostly in RAM (which I'd guess many people have) since that would cause more work to be put on the CPU/Memory subsystems. Other people on the list hinted at this, but I share their hypothesis that once you get IO involved as a bottleneck (which is a more typical DB situation) you won't notice compiler options. I've got a 2 socket x 2 core woodcrest poweredge 2950 with a BBC 6 disk RAID I'll run some tests on as soon as I get a chance. I'm also thinking for this test, there's no need to tweak the default config other than maybe checkpoint_segments, since I don't really want postgres using large amounts of RAM (all that does is require me to build a larger test DB). Thoughts? Thanks, Bucky
At 01:49 PM 12/13/2006, Bucky Jordan wrote: >I've only seen pg_bench numbers > 2,000 tps on either really large >hardware (none of the above mentioned comes close) or the results >are in memory due to a small database size (aka measuring update contention). Which makes a laptop achieving such numbers all the more interesting IMHO. >Just a guess, but these tests (compiler opts.) seem like they >sometimes show a benefit where the database is mostly in RAM (which >I'd guess many people have) since that would cause more work to be >put on the CPU/Memory subsystems. The cases where the working set, or the performance critical part of the working set, of the DB is RAM resident are very important ones ITRW. >Other people on the list hinted at this, but I share their >hypothesis that once you get IO involved as a bottleneck (which is a >more typical DB situation) you won't notice compiler options. Certainly makes intuitive sense. OTOH, this list has seen discussion of what should be IO bound operations being CPU bound. Evidently due to the expense of processing pg datastructures. Only objective benches are going to tell us where the various limitations on pg performance really are. >I've got a 2 socket x 2 core woodcrest poweredge 2950 with a BBC 6 >disk RAID I'll run some tests on as soon as I get a chance. > >I'm also thinking for this test, there's no need to tweak the >default config other than maybe checkpoint_segments, since I don't >really want postgres using large amounts of RAM (all that does is >require me to build a larger test DB). Daniel's orginal system had 512MB RAM. This suggests to me that tests involving 256MB of pg memory should be plenty big enough. >Thoughts? Hope they are useful. Ron Peacetree
On 13.12.2006, at 19:03, Ron wrote: > What I find interesting is that so far Guido's C2D Mac laptop has > gotten the highest values by far in this set of experiments, and no > one else is even close. This might be the case because I have tested with fsync=off as my internal harddrive would be a limiting factor and the results wouldn't be really helpful. Perhaps it's still the IO system, I don't know. I can try to reproduce the tests as close as possible again. Perhaps I had different settings on something but I doubt that. The new Core * CPUs from Intel are extremely fast with PostgreSQL. > Anyone got a 2.33 GHz C2D box with a decent HD IO subsystem more > representative of a typical DB server hooked up to it? I have also now an Xserve with two Dual-Core Xeons and two SAS drives (15k Seagates) in a mirrored RAID here. Will do some testing tomorrow. Btw: I always compare only to my own results to have something comparable - same test, same scripts, same db version, same operating system and so on. The rest is just pure interest. cug
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > While skimming over the pgbench source it has looked to me like it's > > necessary to pass the -s switch (scale factor) to both the > > initialization (-i) and the subsequent (non -i) runs. > > No, it's not supposed to be, and I've never found it needed in practice. > The code seems able to pull the scale out of the database (I forget how > it figures it out exactly). pgbench is designed to be a general benchmark, meanining it exercises all parts of the system. I am thinking just reexecuting a single SELECT over and over again would be a better test of the CPU optimizations. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
[offtopic]; hmm quite a long thread below is stats of posting Total Messages:87 Total Participants: 27 ------------------------------------------------- 19 Daniel van Ham Colchete 12 Michael Stone 9 Ron 5 Steinar H. Gunderson 5 Alexander Staubo 4 Tom Lane 4 Greg Smith 3 Luke Lonergan 3 Christopher Browne 2 Merlin Moncure 2 Guido Neitzer 2 Dave Cramer 2 Cosimo Streppone 2 Bucky Jordan 1 Tatsuo Ishii 1 Sven Geisler 1 Shane Ambler 1 Michael Glaesemann 1 Mark Kirkwood 1 Gene 1 Florian Weimer 1 David Boreham 1 Craig A. James 1 Chris Browne 1 Brad Nicholson 1 Bill Moran 1 Alvaro Herrera -------------------------------------------
Bruce, > pgbench is designed to be a general benchmark, meanining it exercises > all parts of the system. I am thinking just reexecuting a single SELECT > over and over again would be a better test of the CPU optimizations. Mostly, though, pgbench just gives the I/O system a workout. It's not a really good general workload. -- Josh Berkus PostgreSQL @ Sun San Francisco
On Wed, 2006-12-13 at 18:36 -0800, Josh Berkus wrote: > Bruce, > > > pgbench is designed to be a general benchmark, meanining it exercises > > all parts of the system. I am thinking just reexecuting a single SELECT > > over and over again would be a better test of the CPU optimizations. > > Mostly, though, pgbench just gives the I/O system a workout. It's not a > really good general workload. It also will not utilize all cpus on a many cpu machine. We recently found that the only way to *really* test with pgbench was to actually run 4+ copies of pgbench at the same time. J > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
"Joshua D. Drake" <jd@commandprompt.com> writes: > On Wed, 2006-12-13 at 18:36 -0800, Josh Berkus wrote: >> Mostly, though, pgbench just gives the I/O system a workout. It's not a >> really good general workload. > It also will not utilize all cpus on a many cpu machine. We recently > found that the only way to *really* test with pgbench was to actually > run 4+ copies of pgbench at the same time. The pgbench app itself becomes the bottleneck at high transaction rates. Awhile back I rewrote it to improve its ability to issue commands concurrently, but then desisted from submitting the changes --- if we change the app like that, future numbers would be incomparable to past ones, which sort of defeats the purpose of a benchmark no? regards, tom lane
On Dec 14, 2006, at 14:44 , Tom Lane wrote: > The pgbench app itself becomes the bottleneck at high transaction > rates. Awhile back I rewrote it to improve its ability to issue > commands concurrently, but then desisted from submitting the > changes --- if we change the app like that, future numbers would > be incomparable to past ones, which sort of defeats the purpose of a > benchmark no? At the same time, if the current pgbench isn't the tool we want to use, is this kind of backward comparison going to hinder any move to improve it? It sounds like there's quite a bit of room for improvement in pg_bench, and in my opinion we should move forward to make an improved tool, one that measures what we want to measure. And while comparison with past results might not be possible, there remains the possibility of rerunning the improved pgbench on previous systems, I should think. Michael Glaesemann grzm seespotcode net
Benchmarks, like any other SW, need modernizing and updating from time to time. Given the multi-core CPU approach to higher performance as the current fad in CPU architecture, we need a benchmark that is appropriate. If SPEC feels it is appropriate to rev their benchmark suite regularly, we probably should as well. Ron Peacetree At 12:44 AM 12/14/2006, Tom Lane wrote: >"Joshua D. Drake" <jd@commandprompt.com> writes: > > On Wed, 2006-12-13 at 18:36 -0800, Josh Berkus wrote: > >> Mostly, though, pgbench just gives the I/O system a workout. It's not a > >> really good general workload. > > > It also will not utilize all cpus on a many cpu machine. We recently > > found that the only way to *really* test with pgbench was to actually > > run 4+ copies of pgbench at the same time. > >The pgbench app itself becomes the bottleneck at high transaction >rates. Awhile back I rewrote it to improve its ability to issue >commands concurrently, but then desisted from submitting the >changes --- if we change the app like that, future numbers would >be incomparable to past ones, which sort of defeats the purpose of a >benchmark no?
(Re)-Design it to do both, unless there's reason to believe that doing one after the other would skew the results. Then old results are available, new results are also visible and useful for future comparisons. And seeing them side by sidemught be an interesting exercise as well, at least for a while. (sorry for top-posting -- web based interface that doesn't do proper quoting) Greg Williamson DBA GlobeXplorer LLC -----Original Message----- From: pgsql-performance-owner@postgresql.org on behalf of Michael Glaesemann Sent: Wed 12/13/2006 10:11 PM To: Tom Lane Cc: Joshua D. Drake; Josh Berkus; pgsql-performance@postgresql.org; Bruce Momjian; Alvaro Herrera; Alexander Staubo; MichaelStone Subject: Re: [PERFORM] New to PostgreSQL, performance considerations On Dec 14, 2006, at 14:44 , Tom Lane wrote: > The pgbench app itself becomes the bottleneck at high transaction > rates. Awhile back I rewrote it to improve its ability to issue > commands concurrently, but then desisted from submitting the > changes --- if we change the app like that, future numbers would > be incomparable to past ones, which sort of defeats the purpose of a > benchmark no? At the same time, if the current pgbench isn't the tool we want to use, is this kind of backward comparison going to hinder any move to improve it? It sounds like there's quite a bit of room for improvement in pg_bench, and in my opinion we should move forward to make an improved tool, one that measures what we want to measure. And while comparison with past results might not be possible, there remains the possibility of rerunning the improved pgbench on previous systems, I should think. Michael Glaesemann grzm seespotcode net ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster ------------------------------------------------------- Click link below if it is SPAM gsw@globexplorer.com "https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=4580ea76236074356172766&user=gsw@globexplorer.com&retrain=spam&template=history&history_page=1" !DSPAM:4580ea76236074356172766! -------------------------------------------------------
On Wed, 13 Dec 2006, Ron wrote: > The slowest results, Michael's, are on the system with what appears to be the > slowest CPU of the bunch; and the ranking of the rest of the results seem to > similarly depend on relative CPU performance. This is not what one would > naively expect when benching a IO intensive app like a DBMS. pgbench with 3000 total transactions and fsync off is barely doing I/O to disk; it's writing a bunch of data to the filesystem cache and ending the benchmark before the data even makes it to the hard drive. This is why his results become completely different as soon as the number of transactions increases. With little or no actual disk writes, you should expect results to be ranked by CPU speed. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 14 Dec 2006, Tom Lane wrote: > The pgbench app itself becomes the bottleneck at high transaction rates. > Awhile back I rewrote it to improve its ability to issue commands > concurrently, but then desisted from submitting the changes --- if we > change the app like that, future numbers would be incomparable to past > ones, which sort of defeats the purpose of a benchmark no? Not at all. Here's an example from the PC hardware benchmarking landscape. Futuremark Corporation has a very popular benchmark for 3D hardware called 3DMark. Every year, they release a new version, and numbers from it are completely different from those produced by the previous year's version. That lets them rev the entire approach taken by the benchmark to reflect current practice. So when the standard for high-end hardware includes, say, acceleration of lighting effects, the new version will include a lighting test. In order to break 1000 points (say) on that test, you absolutely have to have lighting acceleration, even though on the previous year's test you could score that high without it. That is not an isolated example; every useful PC benchmark gets updated regularly, completely breaking backward compatibility, to reflect the capabilities of current hardware and software. Otherwise we'd still be testing how well DOS runs on new processors. Right now everyone is (or has already) upgraded their PC benchmarking code such that you need a dual-core processor to do well on some of the tests. If you have a pgbench version with better concurrency features, I for one would love to see it. I'm in the middle of patching that thing up right now anyway but hadn't gotten that far (yet--I just spent some of yesterday staring at how it submits into libpq trying to figure out how to improve that). I would be happy to take your changes, my changes, changes to the base code since you forked it, and reconcile everything together into a pgbench2007--whose results can't be directly compared to the earlier version, but are more useful on current gen multi-processor/core systems. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, Dec 13, 2006 at 01:03:04PM -0500, Ron wrote: >What I find interesting is that so far Guido's C2D Mac laptop has >gotten the highest values by far in this set of experiments, and no >one else is even close. >The slowest results, Michael's, are on the system with what appears >to be the slowest CPU of the bunch; and the ranking of the rest of >the results seem to similarly depend on relative CPU >performance. This is not what one would naively expect when benching >a IO intensive app like a DBMS. Note that I ran with fsync off, that the data set is <300M, and that all of the systems (IIRC) have at least 1G RAM. This is exactly the distribution I would expect since we're configuring the benchmark to determine whether cpu-specific optimizations affect the results. Mike Stone
At 10:00 AM 12/14/2006, Greg Smith wrote: >On Wed, 13 Dec 2006, Ron wrote: > >>The slowest results, Michael's, are on the system with what appears >>to be the slowest CPU of the bunch; and the ranking of the rest of >>the results seem to similarly depend on relative CPU >>performance. This is not what one would naively expect when >>benching a IO intensive app like a DBMS. > >pgbench with 3000 total transactions and fsync off is barely doing >I/O to disk; it's writing a bunch of data to the filesystem cache >and ending the benchmark before the data even makes it to the hard >drive. This is why his results become completely different as soon >as the number of transactions increases. With little or no actual >disk writes, you should expect results to be ranked by CPU speed. I of course agree with you in the general sense. OTOH, I'm fairly sure the exact point where this cross-over occurs is dependent on the components and configuration of the system involved. (Nor do I want to dismiss this scenario as irrelevant or unimportant. There are plenty of RW situations where this takes place or where the primary goal of a tuning effort is to make it take place. Multi-GB BB RAID caches anyone?) In addition, let's keep in mind that we all know that overall system performance is limited by whatever component hits its limits first. Local pg performance has been known to be limited by any of : CPUs, memory subsystems, or physical IO subsystems. Intuitively, one tends to expect only the later to be a limiting factor in the vast majority of DBMS tasks. pg has a history of regularly surprising such intuition in many cases. IMO, this makes good bench marking tools and procedures more important to have. (If nothing else, knowing what component is likely to be the bottleneck in system "X" made of components "x1, x2, x3, ...." for task "Y" is valuable lore for the pg community to have as preexisting data when first asked any given question on this list! ) One plausible positive effect of tuning like that Daniel advocates is to "move" the level of system activity where the physical IO subsystem becomes the limiting factor on overall system performance. We are not going to know definitively if such an effect exists, or to what degree, or how to achieve it, if we don't have appropriately rigorous and reproducible experiments (and documentation of them) in place to test for it. So it seem to make sense that the community should have a discussion about the proper bench marking of pg and to get some results based on said. Ron Peacetree
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> On Wed, 2006-12-13 at 18:36 -0800, Josh Berkus wrote: >>> Mostly, though, pgbench just gives the I/O system a workout. It's not a >>> really good general workload. > >> It also will not utilize all cpus on a many cpu machine. We recently >> found that the only way to *really* test with pgbench was to actually >> run 4+ copies of pgbench at the same time. > > The pgbench app itself becomes the bottleneck at high transaction > rates. Awhile back I rewrote it to improve its ability to issue > commands concurrently, but then desisted from submitting the > changes --- if we change the app like that, future numbers would > be incomparable to past ones, which sort of defeats the purpose of a > benchmark no? What is to stop us from running the new pgbench against older versions of PGSQL? Any stats taken from a run of pgbench a long time ago probably aren't relevant against a modern test anyway as the underlying hardware and OS are likely to have changed or been updated.
On Dec 14, 2006, at 16:00 , Greg Smith wrote: > On Wed, 13 Dec 2006, Ron wrote: > >> The slowest results, Michael's, are on the system with what >> appears to be the slowest CPU of the bunch; and the ranking of the >> rest of the results seem to similarly depend on relative CPU >> performance. This is not what one would naively expect when >> benching a IO intensive app like a DBMS. > > pgbench with 3000 total transactions and fsync off is barely doing > I/O to disk; it's writing a bunch of data to the filesystem cache > and ending the benchmark before the data even makes it to the hard > drive. This is why his results become completely different as soon > as the number of transactions increases. With little or no actual > disk writes, you should expect results to be ranked by CPU speed. I also second your suggestion that pgbench should be run with -S to disable updates. As far as I can see, nobody has reported numbers for this setting, so here goes. I also increased the buffer size, which I found was needed to avoid hitting the disk for block reads, and increased the memory settings. My PostgreSQL config overrides, then, are: shared_buffers = 1024MB work_mem = 1MB maintenance_work_mem = 16MB fsync = off Environment: Linux 2.6.15-23-amd64-generic on Ubuntu. Dual-core AMD Opteron 280 with 4GB of RAM. LSI PCI-X Fusion-MPT SAS. Running with: pgbench -S -v -n -t 5000 -c 5. Results as a graph: http://purefiction.net/paste/pgbench.pdf Stats for CFLAGS="-O0": 18440.181894 19207.882300 19894.432185 19635.625622 19876.858884 20032.597042 19683.597973 20370.166669 19989.157881 20207.343510 19993.745956 20081.353580 20356.416424 20047.810017 20319.834190 19417.807528 19906.788454 20536.039929 19491.308046 20002.144230 Stats for CFLAGS="-O3 -msse2 -mfpmath=sse -funroll-loops -m64 - march=opteron -pipe": 23830.358351 26162.203569 25569.091264 26762.755665 26590.822550 26864.908197 26608.029665 26796.116921 26323.742015 26692.576261 26878.859132 26106.770425 26328.371664 26755.595130 25488.304946 26635.527959 26377.485023 24817.590708 26480.245737 26223.427801 Alexander.
Alexander, Good stuff. Can you do runs with just CFLAGS="-O3" and just CFLAGS="-msse2 -mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" as well ? As it is, you've given a good lower and upper bound on your performance obtainable using compiler options, but you've given no data to show what effect arch specific compiler options have by themselves. Also, what HDs are you using? How many in what config? Thanks in Advance, Ron Peacetree At 02:14 PM 12/14/2006, Alexander Staubo wrote: >My PostgreSQL config overrides, then, are: > >shared_buffers = 1024MB >work_mem = 1MB >maintenance_work_mem = 16MB >fsync = off > >Environment: Linux 2.6.15-23-amd64-generic on Ubuntu. Dual-core AMD >Opteron 280 with 4GB of RAM. LSI PCI-X Fusion-MPT SAS. > >Running with: pgbench -S -v -n -t 5000 -c 5. > >Results as a graph: http://purefiction.net/paste/pgbench.pdf > >Stats for CFLAGS="-O0": 18440.181894 19207.882300 19894.432185 >19635.625622 19876.858884 20032.597042 19683.597973 20370.166669 >19989.157881 20207.343510 19993.745956 20081.353580 20356.416424 >20047.810017 20319.834190 19417.807528 19906.788454 20536.039929 >19491.308046 20002.144230 > >Stats for CFLAGS="-O3 -msse2 -mfpmath=sse -funroll-loops -m64 - >march=opteron -pipe": 23830.358351 26162.203569 25569.091264 >26762.755665 26590.822550 26864.908197 26608.029665 26796.116921 >26323.742015 26692.576261 26878.859132 26106.770425 26328.371664 >26755.595130 25488.304946 26635.527959 26377.485023 24817.590708 >26480.245737 26223.427801
On Dec 14, 2006, at 20:28 , Ron wrote: > Can you do runs with just CFLAGS="-O3" and just CFLAGS="-msse2 - > mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" as well ? All right. From my perspective, the effect of -O3 is significant, whereas architecture-related optimizations have no statistically significant effect. As far as I'm aware, though, there's no other arch targets on the Opteron that will make sense, there being no predecessor CPU instruction set to choose from; -march=pentium4 doesn't exist. > Also, what HDs are you using? How many in what config? I believe the volume is a two-drive RAID 1 configuration, but I'm not managing these servers, so I'll ask the company's support people. Interestingly enough I see that PostgreSQL seems to be writing around 1MB/s during the pgbench run, even though I'm running pgbench in the - S mode. I haven't had the chance to look at the source yet; is it really only doing selects? Alexander.
At 05:39 PM 12/14/2006, Alexander Staubo wrote: >On Dec 14, 2006, at 20:28 , Ron wrote: > >>Can you do runs with just CFLAGS="-O3" and just CFLAGS="-msse2 - >>mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" as well ? > >All right. From my perspective, the effect of -O3 is significant, >whereas architecture-related optimizations have no statistically >significant effect. Is this opinion? Or have you rerun the tests using the flags I suggested? If so, can you post the results? If "-O3 -msse2 - mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" results in a 30-40% speed up over "-O0", and " -msse2 - mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" results in a 5-10% speedup, then ~ 1/8 - 1/3 of the total possible speedup is due to arch specific optimizations. (testing "-O3" in isolation in addition tests for independence of factors as well as showing what "plain" "-O3" can accomplish.) Some might argue that a 5-10% speedup which represents 1/8 - 1/3 of the total speedup is significant... But enough speculating. I look forward to seeing your data. Ron Peacetree
On Dec 15, 2006, at 01:16 , Ron wrote: > At 05:39 PM 12/14/2006, Alexander Staubo wrote: >> On Dec 14, 2006, at 20:28 , Ron wrote: >> >>> Can you do runs with just CFLAGS="-O3" and just CFLAGS="-msse2 - >>> mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" as well ? >> >> All right. From my perspective, the effect of -O3 is significant, >> whereas architecture-related optimizations have no statistically >> significant effect. > > Is this opinion? Or have you rerun the tests using the flags I > suggested? If so, can you post the results? Sorry, I neglected to include the pertinent graph: http://purefiction.net/paste/pgbench2.pdf The raw data: CFLAGS="-msse2 -mfpmath=sse -funroll-loops -m64 -march=opteron -pipe": 18480.899621 19977.162108 19640.562003 19823.585944 19500.293284 19964.383540 20228.664827 20515.766366 19956.431120 19740.795459 20184.551390 19984.907398 20457.260691 19771.395220 20159.225628 19907.248149 20197.580815 19947.498185 20209.450748 20088.501904 CFLAGS="-O3" 23814.672315 26846.761905 27137.807960 26957.898233 27109.057570 26997.227925 27291.056939 27565.553643 27422.624323 27392.397185 27757.144967 27402.365372 27563.365421 27349.544685 27544.658154 26957.200592 27523.824623 27457.380654 27052.910082 24452.819263 CFLAGS="-O0" 18440.181894 19207.882300 19894.432185 19635.625622 19876.858884 20032.597042 19683.597973 20370.166669 19989.157881 20207.343510 19993.745956 20081.353580 20356.416424 20047.810017 20319.834190 19417.807528 19906.788454 20536.039929 19491.308046 20002.144230 CFLAGS="-O3 -msse2 -mfpmath=sse -funroll-loops -m64 -march=opteron - pipe" 23830.358351 26162.203569 25569.091264 26762.755665 26590.822550 26864.908197 26608.029665 26796.116921 26323.742015 26692.576261 26878.859132 26106.770425 26328.371664 26755.595130 25488.304946 26635.527959 26377.485023 24817.590708 26480.245737 26223.427801 > If "-O3 -msse2 - mfpmath=sse -funroll-loops -m64 - march=opteron - > pipe" results in a 30-40% speed up over "-O0", and > " -msse2 - mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" > results in a 5-10% speedup, then ~ 1/8 - 1/3 of the total possible > speedup is due to arch specific optimizations. Unfortunately, I don't see a 5-10% speedup; "-O0" and "-msse2 ..." are statistically identical. Alexander.
At 07:27 PM 12/14/2006, Alexander Staubo wrote: >Sorry, I neglected to include the pertinent graph: > > http://purefiction.net/paste/pgbench2.pdf In fact, your graph suggests that using arch specific options in addition to -O3 actually =hurts= performance. ...that seems unexpected... Ron Peacetree
On Thu, 14 Dec 2006, Alexander Staubo wrote: > Interestingly enough I see that PostgreSQL seems to be writing around 1MB/s > during the pgbench run, even though I'm running pgbench in the -S mode. I > haven't had the chance to look at the source yet; is it really only doing > selects? I've noticed the same thing and have been meaning to figure out what the cause is. It's just doing a select in there; it's not even in a begin/end block. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Dec 15, 2006, at 04:09 , Ron wrote: > At 07:27 PM 12/14/2006, Alexander Staubo wrote: > >> Sorry, I neglected to include the pertinent graph: >> >> http://purefiction.net/paste/pgbench2.pdf > In fact, your graph suggests that using arch specific options in > addition to -O3 actually =hurts= performance. The difference is very slight. I'm going to run without -funroll- loops and -pipe (which are not arch-related) to get better data. Alexander.
On Fri, Dec 15, 2006 at 10:53:25AM +0100, Alexander Staubo wrote: > The difference is very slight. I'm going to run without -funroll- > loops and -pipe (which are not arch-related) to get better data. -pipe does not matter for the generated code; it only affects compiler speed. (It simply means that the compiler runs cpp | cc | as1 instead of cpp > tmp; cc < tmp > tmp2; as1 < tmp2.) /* Steinar */ -- Homepage: http://www.sesse.net/
At 04:54 AM 12/15/2006, Alexander Staubo wrote: >On Dec 15, 2006, at 04:09 , Ron wrote: > >>At 07:27 PM 12/14/2006, Alexander Staubo wrote: >> >>>Sorry, I neglected to include the pertinent graph: >>> >>> http://purefiction.net/paste/pgbench2.pdf >>In fact, your graph suggests that using arch specific options in >>addition to -O3 actually =hurts= performance. > >According to the tech staff, this is a Sun X4100 with a two-drive >RAID 1 volume. No idea about the make of the hard drives. > >Alexander. http://www.sun.com/servers/entry/x4100/features.xml So we are dealing with a 1U 1-4S (which means 1-8C) AMD Kx box with up to 32GB of ECC RAM (DDR2 ?) and 2 Seagate 2.5" SAS HDs. http://www.seagate.com/cda/products/discsales/index/1,,,00.html?interface=SAS My bet is the X4100 contains one of the 3 models of Cheetah 15K.4's. A simple du, dkinfo, whatever, will tell you which. I'm looking more closely into exactly what the various gcc -O optimizations do on Kx's as well. 64b vs 32b gets x86 compatible code access to ~ 2x as many registers; and MMX or SSE instructions get you access to not only more registers, but wider ones as well. As one wit has noted, "all optimization is an exercise in caching." (Terje Mathisen- one of the better assembler coders on the planet.) It seems unusual that code generation options which give access to more registers would ever result in slower code... Ron Peacetree
On 12/15/06, Ron <rjpeace@earthlink.net> wrote: > I'm looking more closely into exactly what the various gcc -O > optimizations do on Kx's as well. > 64b vs 32b gets x86 compatible code access to ~ 2x as many registers; > and MMX or SSE instructions get you access to not only more > registers, but wider ones as well. > > As one wit has noted, "all optimization is an exercise in caching." > (Terje Mathisen- one of the better assembler coders on the planet.) > > It seems unusual that code generation options which give access to > more registers would ever result in slower code... The slower is probably due to the unroll loops switch which can actually hurt code due to the larger footprint (less cache coherency). The extra registers are not all that important because of pipelining and other hardware tricks. Pretty much all the old assembly strategies such as forcing local variables to registers are basically obsolete...especially with regards to integer math. As I said before, modern CPUs are essentially RISC engines with a CISC preprocessing engine laid in top. Things are much more complicated than they were in the old days where you could count instructions for the assembly optimization process. I suspect that there is little or no differnece between the -march=686 and the various specifc archicectures. Did anybody think to look at the binaries and look for the amount of differences? I bet you code compiled for march=opteron will just fine on a pentium 2 if compiled for 32 bit. merlin
On Fri, 15 Dec 2006, Merlin Moncure wrote: > The slower is probably due to the unroll loops switch which can > actually hurt code due to the larger footprint (less cache coherency). The cache issues are so important with current processors that I'd suggest throwing -Os (optimize for size) into the mix people test. That one may stack usefully with -O2, but probably not with -O3 (3 includes optimizations that increase code size). -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
At 09:23 AM 12/15/2006, Merlin Moncure wrote: >On 12/15/06, Ron <rjpeace@earthlink.net> wrote: >> >>It seems unusual that code generation options which give access to >>more registers would ever result in slower code... > >The slower is probably due to the unroll loops switch which can >actually hurt code due to the larger footprint (less cache coherency). I have seen that effect as well occasionally in the last few decades ;-) OTOH, suspicion is not _proof_; and I've seen other "optimizations" turn out to be "pessimizations" over the years as well. >The extra registers are not all that important because of pipelining >and other hardware tricks. No. Whoever told you this or gave you such an impression was mistaken. There are many instances of x86 compatible code that get 30-40% speedups just because they get access to 16 rather than 8 GPRs when recompiled for x84-64. >Pretty much all the old assembly strategies such as forcing local >variables to registers are basically obsolete...especially with >regards to integer math. Again, not true. OTOH, humans are unlikely at this point to be able to duplicate the accuracy of the compiler's register coloring algorithms. Especially on x86 compatibles. (The flip side is that _expert_ humans can often put the quirky register set and instruction pipelines of x86 compatibles to more effective use for a specific chunk of code than even the best compilers can.) >As I said before, modern CPUs are essentially RISC engines with a >CISC preprocessing engine laid in top. I'm sure you meant modern =x86 compatible= CPUs are essentially RISC engines with a CISC engine on top. Just as "all the world's not a VAX", "all CPUs are not x86 compatibles". Forgetting this has occasionally cost folks I know... >Things are much more complicated than they were in the old days >where you could count instructions for the assembly optimization process. Those were the =very= old days in computer time... >I suspect that there is little or no differnece between the >-march=686 and the various specifc archicectures. There should be. The FSF compiler folks (and the rest of the industry compiler folks for that matter) are far from stupid. They are not just adding compiler switches because they randomly feel like it. Evidence suggests that the most recent CPUs are in need of =more= arch specific TLC compared to their ancestors, and that this trend is not only going to continue, it is going to accelerate. >Did anybody think to look at the binaries and look for the amount of >differences? I bet you code compiled for march=opteron will just >fine on a pentium 2 if compiled >for 32 bit. Sucker bet given that the whole point of a 32b x86 compatible is to be able to run code on any I32 ISA. CPU. OTOH, I bet that code optimized for best performance on a P2 is not getting best performance on a P4. Or vice versa. ;-) The big arch specific differences in Kx's are in 64b mode. Not 32b Ron Peacetree.
On 12/15/06, Ron <rjpeace@earthlink.net> wrote: > At 09:23 AM 12/15/2006, Merlin Moncure wrote: > >On 12/15/06, Ron <rjpeace@earthlink.net> wrote: > >> > >>It seems unusual that code generation options which give access to > >>more registers would ever result in slower code... > > > >The slower is probably due to the unroll loops switch which can > >actually hurt code due to the larger footprint (less cache coherency). > > I have seen that effect as well occasionally in the last few decades > ;-) OTOH, suspicion is not _proof_; and I've seen other > "optimizations" turn out to be "pessimizations" over the years as well. > > >The extra registers are not all that important because of pipelining > >and other hardware tricks. > > No. Whoever told you this or gave you such an impression was > mistaken. There are many instances of x86 compatible code that get > 30-40% speedups just because they get access to 16 rather than 8 GPRs > when recompiled for x84-64. I'm not debating that this is true in specific cases. Encryption and video en/decoding have shown to be faster in 64 bit mode on the same achicture (a cursor search in google will confirm this). However, 32-64 bit is not the same argument since there are a lot of other variables besides more registers. 64 bit mode is often slower on many programs because the extra code size from 64 bit pointers. We benchmarked PostgreSQL internally here and found it to be fastest in 32 bit mode running on a 64 bit platform -- this was on a quad opteron 870 runnning our specific software stack, your results might be differnt of course. > >Pretty much all the old assembly strategies such as forcing local > >variables to registers are basically obsolete...especially with > >regards to integer math. > > Again, not true. OTOH, humans are unlikely at this point to be able > to duplicate the accuracy of the compiler's register coloring > algorithms. Especially on x86 compatibles. (The flip side is that > _expert_ humans can often put the quirky register set and instruction > pipelines of x86 compatibles to more effective use for a specific > chunk of code than even the best compilers can.) > > > >As I said before, modern CPUs are essentially RISC engines with a > >CISC preprocessing engine laid in top. > > I'm sure you meant modern =x86 compatible= CPUs are essentially RISC > engines with a CISC engine on top. Just as "all the world's not a > VAX", "all CPUs are not x86 compatibles". Forgetting this has > occasionally cost folks I know... yes, In fact made this point earler. > >Things are much more complicated than they were in the old days > >where you could count instructions for the assembly optimization process. > > Those were the =very= old days in computer time... > > > >I suspect that there is little or no differnece between the > >-march=686 and the various specifc archicectures. > > There should be. The FSF compiler folks (and the rest of the > industry compiler folks for that matter) are far from stupid. They > are not just adding compiler switches because they randomly feel like it. > > Evidence suggests that the most recent CPUs are in need of =more= > arch specific TLC compared to their ancestors, and that this trend is > not only going to continue, it is going to accelerate. > > > >Did anybody think to look at the binaries and look for the amount of > >differences? I bet you code compiled for march=opteron will just > >fine on a pentium 2 if compiled > >for 32 bit. > Sucker bet given that the whole point of a 32b x86 compatible is to > be able to run code on any I32 ISA. CPU. > OTOH, I bet that code optimized for best performance on a P2 is not > getting best performance on a P4. Or vice versa. ;-) > > The big arch specific differences in Kx's are in 64b mode. Not 32b I dont think so. IMO all the processor specific instruction sets were hacks of 32 bit mode to optimize specific tasks. Except for certain things these instructions are rarely, if ever used in 64 bit mode, especially in integer math (i.e. database binaries). Since Intel and AMD64 64 bit are virtually indentical I submit that -march is not really important anymore except for very, very specific (but important) cases like spinlocks. This thread is about how much architecture depenant binares can beat standard ones. I say they don't very much at all, and with the specific exception of Daniel's benchmarking the results posted to this list bear that out. merlin
At 09:50 AM 12/15/2006, Greg Smith wrote: >On Fri, 15 Dec 2006, Merlin Moncure wrote: > >>The slower is probably due to the unroll loops switch which can >>actually hurt code due to the larger footprint (less cache coherency). > >The cache issues are so important with current processors that I'd >suggest throwing -Os (optimize for size) into the mix people >test. That one may stack usefully with -O2, but probably not with >-O3 (3 includes optimizations that increase code size). -Os Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size. -Os disables the following optimization flags: -falign-functions -falign-jumps -falign-loops -falign-labels -freorder-blocks -freorder-blocks-and-partition -fprefetch-loop-arrays -ftree-vect-loop-version Hmmm. That list of disabled flags bears thought. -falign-functions -falign-jumps -falign-loops -falign-labels 1= Most RISC CPUs performance is very sensitive to misalignment issues. Not recommended to turn these off. -freorder-blocks Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality. Enabled at levels -O2, -O3. -freorder-blocks-and-partition In addition to reordering basic blocks in the compiled function, in order to reduce number of taken branches, partitions hot and cold basic blocks into separate sections of the assembly and .o files, to improve paging and cache locality performance. This optimization is automatically turned off in the presence of exception handling, for link once sections, for functions with a user-defined section attribute and on any architecture that does not support named sections. 2= Most RISC CPUs are cranky about branchy code and (lack of) cache locality. Wouldn't suggest punting these either. -fprefetch-loop-arrays If supported by the target machine, generate instructions to prefetch memory to improve the performance of loops that access large arrays. This option may generate better or worse code; results are highly dependent on the structure of loops within the source code. 3= OTOH, This one looks worth experimenting with turning off. -ftree-vect-loop-version Perform loop versioning when doing loop vectorization on trees. When a loop appears to be vectorizable except that data alignment or data dependence cannot be determined at compile time then vectorized and non-vectorized versions of the loop are generated along with runtime checks for alignment or dependence to control which version is executed. This option is enabled by default except at level -Os where it is disabled. 4= ...and this one looks like a 50/50 shot. Ron Peacetree
At 10:55 AM 12/15/2006, Merlin Moncure wrote: >On 12/15/06, Ron <rjpeace@earthlink.net> wrote: >> >>There are many instances of x86 compatible code that get >>30-40% speedups just because they get access to 16 rather than 8 GPRs >>when recompiled for x84-64. > >...We benchmarked PostgreSQL internally here and found it to be >fastest in 32 bit mode running on a 64 bit platform -- this was on a >quad opteron 870 runnning our specific software stack, your results >might be differnt of course. On AMD Kx's, you probably will get best performance in 64b mode (so you get all those extra registers and other stuff) while using 32b pointers (to keep code size and memory footprint down). On Intel C2's, things are more complicated since Intel's x86-64 implementation and memory IO architecture are both different enough from AMD's to have caused some consternation on occasion when Intel's 64b performance did not match AMD's. >>The big arch specific differences in Kx's are in 64b mode. Not 32b > >I dont think so. IMO all the processor specific instruction sets were >hacks of 32 bit mode to optimize specific tasks. Except for certain >things these instructions are rarely, if ever used in 64 bit mode, >especially in integer math (i.e. database binaries). Since Intel and >AMD64 64 bit are virtually indentical I submit that -march is not >really important anymore except for very, very specific (but >important) cases like spinlocks. Take a good look at the processor specific manuals and the x86-64 benches around the net. The evidence outside the DBMS domain is pretty solidly in contrast to your statement and position. Admittedly, DBMS are not web servers or FPS games or ... That's why we need to do our own rigorous study of the subject. >This thread is about how much architecture depenant binares can beat >standard ones. I say they don't very much at all, and with the >specific exception of Daniel's >benchmarking the results posted to this list bear that out. ...and IMHO the issue is still very much undecided given that we don't have enough properly obtained and documented evidence. ATM, the most we can say is that in a number of systems with modest physical IO subsystems that are not running Gentoo Linux we have not been able to duplicate the results. (We've also gotten some interesting results like yours suggesting the arch specific optimizations are bad for pg performance in your environment.) In the process questions have been asked and issues raised regarding both the tolls involved and the proper way to use them. We really do need to have someone other than Daniel duplicate his Gentoo environment and independently try to duplicate his results. ...and let us bear in mind that this is not just intellectual curiosity. The less pg is mysterious, the better the odds pg will be adopted in any specific case. Ron Peacetree
On Dec 15, 2006, at 17:53 , Ron wrote: > At 09:50 AM 12/15/2006, Greg Smith wrote: >> On Fri, 15 Dec 2006, Merlin Moncure wrote: >> >>> The slower is probably due to the unroll loops switch which can >>> actually hurt code due to the larger footprint (less cache >>> coherency). >> >> The cache issues are so important with current processors that I'd >> suggest throwing -Os (optimize for size) into the mix people >> test. That one may stack usefully with -O2, but probably not with >> -O3 (3 includes optimizations that increase code size). > > -Os > Optimize for size. -Os enables all -O2 optimizations that do not > typically increase code size. It also performs further > optimizations designed to reduce code size. So far I have been compiling PostgreSQL and running my pgbench script manually, but this makes me want to modify my script to run pgbench automatically using all possible permutations of a set of compiler flags. Last I tried GCC to produce 32-bit code on this Opteron system, though, it complained about the lack of a compiler; can I persuade it to generate 32-bit code (or 32-bit pointers for that matter) without going the cross-compilation route? Alexander.
On Fri, Dec 15, 2006 at 12:24:46PM -0500, Ron wrote: >ATM, the most we can say is that in a number of systems with modest >physical IO subsystems So I reran it on a 3.2GHz xeon with 6G RAM off a ramdisk; I/O ain't the bottleneck on that one. Results didn't show didn't show any signficant gains regardless of compilation options (results hovered around 12k tps). If people want to continue this, I will point out that they should make sure they're linked against the optimized libpq rather than an existing one elsewhere in the library path. Beyond that, I'm done with this thread. Maybe there are some gains to be found somewhere, but the testing done thus far (while limited) is sufficient, IMO, to demonstrate that compiler options aren't going to provide a blow-your-socks-off dramatic performance improvement. Mike Stone
Alexander Staubo wrote: > On Dec 15, 2006, at 17:53 , Ron wrote: > >> At 09:50 AM 12/15/2006, Greg Smith wrote: >> >>> On Fri, 15 Dec 2006, Merlin Moncure wrote: >>> >>>> The slower is probably due to the unroll loops switch which can >>>> actually hurt code due to the larger footprint (less cache coherency). >>> >>> The cache issues are so important with current processors that I'd >>> suggest throwing -Os (optimize for size) into the mix people test. > > So far I have been compiling PostgreSQL and running my pgbench script > manually, but this makes me want to modify my script to run pgbench > automatically using all possible permutations of a set of compiler flags. I don't know if it's practical, but this link comes to mind: http://clusty.com/search?query=acovea -- Cosimo
At 07:06 PM 12/15/2006, Michael Stone wrote: >On Fri, Dec 15, 2006 at 12:24:46PM -0500, Ron wrote: >>ATM, the most we can say is that in a number of systems with modest >>physical IO subsystems > > >So I reran it on a 3.2GHz xeon with 6G RAM off a ramdisk; I/O ain't >the bottleneck on that one. Results didn't show didn't show any >signficant gains regardless of compilation options (results hovered >around 12k tps). If people want to continue this, I will point out >that they should make sure they're linked against the optimized >libpq rather than an existing one elsewhere in the library path. >Beyond that, I'm done with this thread. Maybe there are some gains >to be found somewhere, but the testing done thus far (while limited) >is sufficient, IMO, to demonstrate that compiler options aren't >going to provide a blow-your-socks-off dramatic performance improvement. AFAICT, no one has stated there would be a "blow-your-socks-off dramatic performance improvement" for pg due to compilation options. Just that there might be some, and there might be some that are arch specific. So far these experiments have shown = multiple instances of a ~30-35% performance improvement going from -O0 to --O3 = 1 instance of arch specific options hurting performance when combined with -O3 = 1 instance of arch specific options helping performance on an OS that only one person has tested (Gentoo Linux) = that a 2.33 GHz C2D Mac laptop (under what OS?) with a typical laptop modest physical IO subystem can do ~2100tps = that pg has a "speed limit" on a 3.2GHz Xeon (which kind?) with 6G RAM off a ramdisk (under what OS?) of ~12K tps (I'd be curious to see what this limit is with better CPUs and memory subsystems) Note that except for the first point, all the other results are singletons that as of yet have not been reproduced. The most important "gain" IMO is Knowledge, and I'd say there is still more to learn and/or verify IMHO. YMMV. Ron Peacetree
On Sat, Dec 16, 2006 at 10:53:21AM -0500, Ron wrote: > AFAICT, no one has stated there would be a "blow-your-socks-off > dramatic performance improvement" for pg due to compilation > options. Just that there might be some, and there might be some that > are arch specific. FWIW, the original claim was: "It's really important to have your GLIBC compiled for your processor. It is essencial for performance." /* Steinar */ -- Homepage: http://www.sesse.net/
On Sat, Dec 16, 2006 at 10:53:21AM -0500, Ron wrote: >The most important "gain" IMO is Knowledge, and I'd say there is >still more to learn and/or verify IMHO. YMMV. Well, I think there are other areas where I can spend my time where potential gains are more likely. YMMV (although, I note, you don't seem to be spending much of your own time testing this) Mike Stone
Sorry for the delay in responding. I had familial obligations. As a matter of fact, I am spending a decent amount of time on this. I don't usually pore through documentation for compilers and OS's to the degree I've been since this thread started. Nor do I usually try and get access to the HW I'm presently tracking down. I'll post my thoughts re: detailed analysis of gcc/g++ compiler options later today or tomorrow as work schedule allows. Why this is worth it: 1= Any gains from setup and configuration are the cheapest ones available once we codify how to obtain them. 2= any public database or knowledge about how to best setup, configure, and test pg is very good for the community. 3= developers need to know and agree on proper procedure and technique for generating results for discussion or we end up wasting a lot of time. 4= measuring and documenting pg performance means we know where best to allocate resources for improving pg. Or where using pg is (in)appropriate compared to competitors. Potential performance gains are not the only value of this thread. Ron Peacetree At 12:33 PM 12/16/2006, Michael Stone wrote: >On Sat, Dec 16, 2006 at 10:53:21AM -0500, Ron wrote: >>The most important "gain" IMO is Knowledge, and I'd say there is >>still more to learn and/or verify IMHO. YMMV. > >Well, I think there are other areas where I can spend my time where >potential gains are more likely. YMMV (although, I note, you don't >seem to be spending much of your own time testing this)