Thread: Need help with 8.4 Performance Testing
Database performance geeks, We have a number of patches pending for 8.4 designed to improve database performance in a variety of circumstances. We need as many users as possible to build test versions of PostgreSQL with these patches, and test how well they perform, and report back in some detail. Particularly, users with unusual hardware architectures (16 or more cores, ARM, Power, SSD, NFS-mounted data) or operating systems (Solaris, OSX, Windows-64) are really helpful. Testers need to be familiar with building PostgreSQL from source and patching it, as well as basic PostgreSQL Tuning (except for the Wizard Patch) and have some kind of performance test available, ideally something based on your own application use. If you are going to use pgbench to test, *please* read Greg Smith's notes first: http://www.westnet.com/~gsmith/gregsmith/content/postgresql/pgbench-scaling.htm The Wiki (http://wiki.postgresql.org/wiki/CommitFest_2008-11) has a full list of patches, but below are the ones in particular we could use help with. You *do* need to read the entire mail threads which I link to below to understand the patches. Thanks for your help! Proposal of PITR performance improvement (Koichi Suzuki): http://archives.postgresql.org/message-id/a778a7260811270404g49254640x8ed58b12b7c65d0b@mail.gmail.com http://archives.postgresql.org/message-id/a778a7260810280033n43f70d36x8c437eacf9a5461e@mail.gmail.com Simple postgresql.conf wizard http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811012101220.17619@westnet.com http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811291403040.12885@westnet.com Improve Performance of Multi-Batch Hash Join for Skewed Data Sets http://archives.postgresql.org/message-id/6EEA43D22289484890D119821101B1DF2C1683@exchange20.mercury.ad.ubc.ca http://archives.postgresql.org/message-id/1924d1180811051606w19aaf30du589e8ea10ea5534d@mail.gmail.com http://archives.postgresql.org/message-id/22901.1227228246@sss.pgh.pa.us Window Functions http://archives.postgresql.org/message-id/e08cc0400810270912u49a6ec83vc23984c01f368f76@mail.gmail.com http://archives.postgresql.org/message-id/492D1246.5070101@enterprisedb.com http://archives.postgresql.org/message-id/839FB90FF49D4120B7107ED0D7B3E5B6@amd64 parallel restore (especially need to test on 16+ cores) http://archives.postgresql.org/message-id/490878AC.1@dunslane.net B-Tree emulation for GIN http://archives.postgresql.org/message-id/491B1888.9020903@sigaev.ru http://archives.postgresql.org/message-id/Pine.LNX.4.64.0811191828050.7862@sn.sai.msu.ru Also, the following patches currently still have bugs, but when the bugs are fixed I'll be looking for performance testers, so please either watch the wiki or watch this space: -- Block-level CRC checks (Alvaro Herrera) -- Auto Partitioning Patch (Nikhil Sontakke) -- posix_fadvise (Gregory Stark) -- Hash Join-Filter Pruning using Bloom Filters -- On-disk bitmap indexes Please report your results, with the patchname in the subject line, on this mailing list or on -hackers. Thank you, and your help will get a better 8.4 out sooner. -- Josh Berkus PostgreSQL San Francisco
Josh, Since a number of these performance patches use our hash function, would it make sense to apply the last patch to upgrade the hash function mix() to the two function mix()/final()? Since the additional changes increases the performance of the hash function by another 50% or so. My two cents. Regards, Ken On Sun, Dec 07, 2008 at 11:38:01AM -0800, Josh Berkus wrote: > Database performance geeks, > > We have a number of patches pending for 8.4 designed to improve database > performance in a variety of circumstances. We need as many users as possible > to build test versions of PostgreSQL with these patches, and test how well > they perform, and report back in some detail. > > Particularly, users with unusual hardware architectures (16 or more cores, > ARM, Power, SSD, NFS-mounted data) or operating systems (Solaris, OSX, > Windows-64) are really helpful. Testers need to be familiar with building > PostgreSQL from source and patching it, as well as basic PostgreSQL Tuning > (except for the Wizard Patch) and have some kind of performance test > available, ideally something based on your own application use. > > If you are going to use pgbench to test, *please* read Greg Smith's notes > first: > http://www.westnet.com/~gsmith/gregsmith/content/postgresql/pgbench-scaling.htm > > The Wiki (http://wiki.postgresql.org/wiki/CommitFest_2008-11) has a full list > of patches, but below are the ones in particular we could use help with. > > You *do* need to read the entire mail threads which I link to below to > understand the patches. Thanks for your help! > > Proposal of PITR performance improvement (Koichi Suzuki): > http://archives.postgresql.org/message-id/a778a7260811270404g49254640x8ed58b12b7c65d0b@mail.gmail.com > http://archives.postgresql.org/message-id/a778a7260810280033n43f70d36x8c437eacf9a5461e@mail.gmail.com > > Simple postgresql.conf wizard > http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811012101220.17619@westnet.com > http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811291403040.12885@westnet.com > > Improve Performance of Multi-Batch Hash Join for Skewed Data Sets > http://archives.postgresql.org/message-id/6EEA43D22289484890D119821101B1DF2C1683@exchange20.mercury.ad.ubc.ca > http://archives.postgresql.org/message-id/1924d1180811051606w19aaf30du589e8ea10ea5534d@mail.gmail.com > http://archives.postgresql.org/message-id/22901.1227228246@sss.pgh.pa.us > > Window Functions > http://archives.postgresql.org/message-id/e08cc0400810270912u49a6ec83vc23984c01f368f76@mail.gmail.com > http://archives.postgresql.org/message-id/492D1246.5070101@enterprisedb.com > http://archives.postgresql.org/message-id/839FB90FF49D4120B7107ED0D7B3E5B6@amd64 > > parallel restore > (especially need to test on 16+ cores) > http://archives.postgresql.org/message-id/490878AC.1@dunslane.net > > B-Tree emulation for GIN > http://archives.postgresql.org/message-id/491B1888.9020903@sigaev.ru > http://archives.postgresql.org/message-id/Pine.LNX.4.64.0811191828050.7862@sn.sai.msu.ru > > Also, the following patches currently still have bugs, but when the bugs are > fixed I'll be looking for performance testers, so please either watch the > wiki or watch this space: > > -- Block-level CRC checks (Alvaro Herrera) > -- Auto Partitioning Patch (Nikhil Sontakke) > -- posix_fadvise (Gregory Stark) > -- Hash Join-Filter Pruning using Bloom Filters > -- On-disk bitmap indexes > > Please report your results, with the patchname in the subject line, on this > mailing list or on -hackers. Thank you, and your help will get a better 8.4 > out sooner. > > -- > Josh Berkus > PostgreSQL > San Francisco > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Sun, Dec 7, 2008 at 12:38 PM, Josh Berkus <josh@agliodbs.com> wrote: I've got a pair of 8 core opteron 16 drive machines I would like to test it on. If nothing else I'll just take queries from the log to run against an 8.4 install. It'll have to be late at night though... > If you are going to use pgbench to test, *please* read Greg Smith's notes > first: > http://www.westnet.com/~gsmith/gregsmith/content/postgresql/pgbench-scaling.htm When I last used pgbench I wanted to test it with an extremely large dataset, but it maxes out at -s 4xxx or so, and that's only in the 40Gigabyte range. Is the limit raised for the pgbench included in contrib in 8.4? I'm guessing it's an arbitrary limit.
I'll be glad to test the patches using pgbench on my POWER4 box running AIX 5.3 and an IA64 that runs HP-UX 11.31. Derek On Dec 7, 2008, at 2:38 PM, Josh Berkus <josh@agliodbs.com> wrote: > Database performance geeks, > > We have a number of patches pending for 8.4 designed to improve > database > performance in a variety of circumstances. We need as many users as > possible > to build test versions of PostgreSQL with these patches, and test > how well > they perform, and report back in some detail. > > Particularly, users with unusual hardware architectures (16 or more > cores, > ARM, Power, SSD, NFS-mounted data) or operating systems (Solaris, OSX, > Windows-64) are really helpful. Testers need to be familiar with > building > PostgreSQL from source and patching it, as well as basic PostgreSQL > Tuning > (except for the Wizard Patch) and have some kind of performance test > available, ideally something based on your own application use. > > If you are going to use pgbench to test, *please* read Greg Smith's > notes > first: > http://www.westnet.com/~gsmith/gregsmith/content/postgresql/pgbench-scaling.htm > > The Wiki (http://wiki.postgresql.org/wiki/CommitFest_2008-11) has a > full list > of patches, but below are the ones in particular we could use help > with. > > You *do* need to read the entire mail threads which I link to below to > understand the patches. Thanks for your help! > > Proposal of PITR performance improvement (Koichi Suzuki): > http://archives.postgresql.org/message-id/a778a7260811270404g49254640x8ed58b12b7c65d0b@mail.gmail.com > http://archives.postgresql.org/message-id/a778a7260810280033n43f70d36x8c437eacf9a5461e@mail.gmail.com > > Simple postgresql.conf wizard > http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811012101220.17619@westnet.com > http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811291403040.12885@westnet.com > > Improve Performance of Multi-Batch Hash Join for Skewed Data Sets > http://archives.postgresql.org/message-id/6EEA43D22289484890D119821101B1DF2C1683@exchange20.mercury.ad.ubc.ca > http://archives.postgresql.org/message-id/1924d1180811051606w19aaf30du589e8ea10ea5534d@mail.gmail.com > http://archives.postgresql.org/message-id/22901.1227228246@sss.pgh.pa.us > > Window Functions > http://archives.postgresql.org/message-id/e08cc0400810270912u49a6ec83vc23984c01f368f76@mail.gmail.com > http://archives.postgresql.org/message-id/492D1246.5070101@enterprisedb.com > http://archives.postgresql.org/message-id/839FB90FF49D4120B7107ED0D7B3E5B6@amd64 > > parallel restore > (especially need to test on 16+ cores) > http://archives.postgresql.org/message-id/490878AC.1@dunslane.net > > B-Tree emulation for GIN > http://archives.postgresql.org/message-id/491B1888.9020903@sigaev.ru > http://archives.postgresql.org/message-id/Pine.LNX.4.64.0811191828050.7862@sn.sai.msu.ru > > Also, the following patches currently still have bugs, but when the > bugs are > fixed I'll be looking for performance testers, so please either > watch the > wiki or watch this space: > > -- Block-level CRC checks (Alvaro Herrera) > -- Auto Partitioning Patch (Nikhil Sontakke) > -- posix_fadvise (Gregory Stark) > -- Hash Join-Filter Pruning using Bloom Filters > -- On-disk bitmap indexes > > Please report your results, with the patchname in the subject line, > on this > mailing list or on -hackers. Thank you, and your help will get a > better 8.4 > out sooner. > > -- > Josh Berkus > PostgreSQL > San Francisco > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Sun, 7 Dec 2008, Scott Marlowe wrote: > When I last used pgbench I wanted to test it with an extremely large > dataset, but it maxes out at -s 4xxx or so, and that's only in the > 40Gigabyte range. Is the limit raised for the pgbench included in > contrib in 8.4? I'm guessing it's an arbitrary limit. There's no artificial limit, just ones that result from things like integer overflow. I don't think has been an issue so far because pgbench becomes seek limited and stops producing interesting results once the database exceeds the sum of all available caching, which means you'd need more than 32GB of RAM in the system running pgbench before this is an issue. Which happens to be the largest size system I've ever ran it on... I'd expect this statement around line 1060 of pgbench.c to overflow first: for (i = 0; i < naccounts * scale; i++) Where i is an integer, naccounts=100,000 , and scale is what you input. That will overflow a signed 32-bit integer at a scale of 2147. If you had tests that failed at twice that, I wonder if you were actually executing against a weird data set for scales of (2148..4294). It's not completely trivial to fix (I just tried), the whole loop needs to be restructured to run against scale and 100,000 separately while keeping the current progress report intact. I'll take a shot at fixing it correctly, this is a bug that should be corrected before 8.4 goes out. I guarantee that version will be used on systems with 64GB of RAM where this matters. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, Dec 8, 2008 at 1:15 PM, Greg Smith <gsmith@gregsmith.com> wrote: > On Sun, 7 Dec 2008, Scott Marlowe wrote: > >> When I last used pgbench I wanted to test it with an extremely large >> dataset, but it maxes out at -s 4xxx or so, and that's only in the >> 40Gigabyte range. Is the limit raised for the pgbench included in >> contrib in 8.4? I'm guessing it's an arbitrary limit. > > There's no artificial limit, just ones that result from things like integer > overflow. I don't think has been an issue so far because pgbench becomes > seek limited and stops producing interesting results once the database > exceeds the sum of all available caching, which means you'd need more than > 32GB of RAM in the system running pgbench before this is an issue. Which > happens to be the largest size system I've ever ran it on... Well, I have 32 Gig of ram and wanted to test it against a database that was at least twice as big as memory. I'm not sure why you'd consider the results uninteresting though, I'd think knowing how the db will perform with a very large transactional store that is twice or more the size of memory would be when it starts getting interesting.
On Mon, 8 Dec 2008, Scott Marlowe wrote: > Well, I have 32 Gig of ram and wanted to test it against a database > that was at least twice as big as memory. I'm not sure why you'd > consider the results uninteresting though, I'd think knowing how the > db will perform with a very large transactional store that is twice or > more the size of memory would be when it starts getting interesting. If you refer back to the picture associated with the link Josh suggested: http://www.westnet.com/~gsmith/gregsmith/content/postgresql/scaling.png You'll see that pgbench results hit a big drop once you clear the amount of memory being used to cache the accounts table. This curve isn't unique to what I did; I've seen the same basic shape traced out by multiple other testers on different hardware, independent of me. It just expands to the right based on the amount of RAM available. All I was trying to suggest was that even if you've got 32GB of RAM, you may already be into the more flat right section of that curve even with a 40GB database. That was a little system with 1GB of RAM+256MB of disk cache, and it was already toast at 750MB of database. Once you've gotten a database big enough to reach that point, results degrade to something related to database seeks/second rather than anything else, and further increases don't give you that much more info. This is why I'm not sure if the current limit really matters with 32GB of RAM, but it sure will be important if you want any sort of useful pgbench results at 64GB. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, Dec 8, 2008 at 5:52 PM, Greg Smith <gsmith@gregsmith.com> wrote: > On Mon, 8 Dec 2008, Scott Marlowe wrote: > >> Well, I have 32 Gig of ram and wanted to test it against a database >> that was at least twice as big as memory. I'm not sure why you'd >> consider the results uninteresting though, I'd think knowing how the >> db will perform with a very large transactional store that is twice or >> more the size of memory would be when it starts getting interesting. > > If you refer back to the picture associated with the link Josh suggested: > > http://www.westnet.com/~gsmith/gregsmith/content/postgresql/scaling.png > > You'll see that pgbench results hit a big drop once you clear the amount of > memory being used to cache the accounts table. This curve isn't unique to > what I did; I've seen the same basic shape traced out by multiple other > testers on different hardware, independent of me. It just expands to the > right based on the amount of RAM available. > > All I was trying to suggest was that even if you've got 32GB of RAM, you may > already be into the more flat right section of that curve even with a 40GB > database. That was a little system with 1GB of RAM+256MB of disk cache, and > it was already toast at 750MB of database. Once you've gotten a database > big enough to reach that point, results degrade to something related to > database seeks/second rather than anything else, and further increases don't > give you that much more info. This is why I'm not sure if the current limit > really matters with 32GB of RAM, but it sure will be important if you want > any sort of useful pgbench results at 64GB. I wonder if shared_buffers has any effect on how far you can go before you hit the 'tipping point'. merlin
On Mon, 8 Dec 2008, Merlin Moncure wrote: > I wonder if shared_buffers has any effect on how far you can go before > you hit the 'tipping point'. If your operating system has any reasonable caching itself, not so much at first. As long as the index on the account table fits in shared_buffers, even the basic sort of caching logic an OS uses is perfectly functional for swapping the individual pages of the account table in and out, the main limiting factor on pgbench performance. There is a further out tipping point I've theorized about but not really explored: the point where even the database indexes stop fitting in memory usefully. As you get closer to that, I'd expect that the clock sweep algorithm used by shared_buffers should make it a bit more likely that those important blocks would hang around usefully if you put them there, rather than giving most of the memory to the OS to manage. Since the data is about 7.5X as large as the indexes, that point is way further out than the basic bottlenecks. And if you graph pgbench results on a scale that usefully shows the results for in-memory TPS scores, you can barely see that part of the chart a well. One day I may get to mapping that out better, and if I do it will be interesting to see if the balance of shared_buffers to OS cache works the way I expect. I was waiting until I finished the pgtune program for that, that's building some of the guts I wanted to make it easier to tweak postgresql.conf settings programmatically in between pgbench runs. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > On Mon, 8 Dec 2008, Merlin Moncure wrote: > >> I wonder if shared_buffers has any effect on how far you can go before >> you hit the 'tipping point'. > > If your operating system has any reasonable caching itself, not so much at > first. As long as the index on the account table fits in shared_buffers, even > the basic sort of caching logic an OS uses is perfectly functional for swapping > the individual pages of the account table in and out, the main limiting factor > on pgbench performance. I would expect higher shared_buffers to raise the curve before the first breakpoint but after the first breakpoint make the drop steeper and deeper. The equilibrium where the curve becomes flatter should be lower. That is, as long as the database fits entirely in RAM having more of the buffers be immediately in shared buffers is better. Once there's contention for the precious cache stealing some of it for duplicated buffers will only hurt. > There is a further out tipping point I've theorized about but not really > explored: the point where even the database indexes stop fitting in memory > usefully. As you get closer to that, I'd expect that the clock sweep algorithm > used by shared_buffers should make it a bit more likely that those important > blocks would hang around usefully if you put them there, rather than giving > most of the memory to the OS to manage. Hm, sounds logical. At that point the slow drop-off should become even shallower and possibly become completely flat. Greater shared_buffers might start helping again at that point. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Greg Smith wrote: | On Mon, 8 Dec 2008, Merlin Moncure wrote: | |> I wonder if shared_buffers has any effect on how far you can go before |> you hit the 'tipping point'. | | If your operating system has any reasonable caching itself, not so much at | first. As long as the index on the account table fits in shared_buffers, | even the basic sort of caching logic an OS uses is perfectly functional | for swapping the individual pages of the account table in and out, the | main limiting factor on pgbench performance. | | There is a further out tipping point I've theorized about but not really | explored: the point where even the database indexes stop fitting in | memory usefully. As you get closer to that, I'd expect that the clock | sweep algorithm used by shared_buffers should make it a bit more likely | that those important blocks would hang around usefully if you put them | there, rather than giving most of the memory to the OS to manage. I am by no means an expert at this. But one thing that can matter is whether you want to improve just the performance of the dbms, or the performance of the entire system, on which the dbms runs. Because if you want to improve the whole system, you would want as much of the caching to take place in the system's buffers so the use of the memory could be optimized over the entire workload, not just the load of the dbms itself. I suppose on a dedicated system with only one dbms running with only one database open (at a time, anyway), this might be moot, but not otherwise. Now I agree that it would be good to get the entire index (or at least the working set of the index) into the memory of the computer. But does it really matter if it is in the system's cache, or the postgres cache? Is it any more likely to be in postgres's cache than in the system cache if the system is hurting for memory? I would think the system would be equally likely to page out "idle" pages no matter where they are unless they are locked to memory, and I do not know if all operating systems can do this, and even if they can, I do not know if postgres uses that ability. I doubt it, since I believe (at least in Linux) a process can do that only if run as root, which I imagine few (if any) users do. | | Since the data is about 7.5X as large as the indexes, that point is way | further out than the basic bottlenecks. And if you graph pgbench results | on a scale that usefully shows the results for in-memory TPS scores, you | can barely see that part of the chart a well. One day I may get to | mapping that out better, and if I do it will be interesting to see if the | balance of shared_buffers to OS cache works the way I expect. I was | waiting until I finished the pgtune program for that, that's building some | of the guts I wanted to make it easier to tweak postgresql.conf settings | programmatically in between pgbench runs. | | -- | * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD | - -- ~ .~. Jean-David Beyer Registered Linux User 85642. ~ /V\ PGP-Key: 9A2FC99A Registered Machine 241939. ~ /( )\ Shrewsbury, New Jersey http://counter.li.org ~ ^^-^^ 07:55:02 up 5 days, 18:13, 4 users, load average: 4.18, 4.17, 4.11 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with CentOS - http://enigmail.mozdev.org iD8DBQFJPm2+Ptu2XpovyZoRAlcJAKCIN098quZKZ7MfAs3MOkuL3WWxrQCdHCVl sUQoIVleRWVLvcMZoihztpE= =n6uO -----END PGP SIGNATURE-----
On Tuesday 09 December 2008 13:08:14 Jean-David Beyer wrote: > > and even if they can, I do not know if postgres uses that ability. I doubt > it, since I believe (at least in Linux) a process can do that only if run > as root, which I imagine few (if any) users do. Disclaimer: I'm not a system programmer... I believe that at Linux kernel revision 2.6.8 and before processes need Posix capability CAP_IPC_LOCK, and 2.6.9 and after they need CAP_IPC_LOCK to lock more than RLIMIT_MEMLOCK. It is a capability, so a process can run as any user assuming it is started with or gained the capability. No idea if Postgres uses any of this, other than to protect security of certain password operations there is probably not much point. If key parts of your database are being paged out, get more RAM, if idle parts of your database are paged out, you probably could more usefully apply that RAM for something else. The Varnish cache design is the place to look for enlightenment on relying on the kernel paging (using memory mapped files) rather than trying to do it yourself, but then a proxy server is a lot simpler than a RDBMS. That said, Varnish is fast at what it does (reverse HTTP proxy) !
On Sun, Dec 7, 2008 at 7:38 PM, Josh Berkus <josh@agliodbs.com> wrote: > > Also, the following patches currently still have bugs, but when the bugs are > fixed I'll be looking for performance testers, so please either watch the > wiki or watch this space: >... > -- posix_fadvise (Gregory Stark) Eh? Quite possibly but none that I'm aware of. The only problem is a couple of trivial bits of bitrot. I'll a post an update now if you want. -- greg
> ________________________________________ > From: pgsql-performance-owner@postgresql.org [pgsql-performance-owner@postgresql.org] On Behalf Of > Jean-David Beyer [jeandavid8@verizon.net] > Sent: Tuesday, December 09, 2008 5:08 AM > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Need help with 8.4 Performance Testing > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > But one thing that can matter is whether you want to improve just the > performance of the dbms, or the performance of the entire system, on which > the dbms runs. Because if you want to improve the whole system, you would > want as much of the caching to take place in the system's buffers so the use > of the memory could be optimized over the entire workload, not just the load > of the dbms itself. I suppose on a dedicated system with only one dbms > running with only one database open (at a time, anyway), this might be moot, > but not otherwise. Yes, the OS is in better position to arbitrate between multiple things. Of course, we aren't talking about the highest performancedatabases if we are talking about mixed use systems though. Additionally, the OS can never really get it right, with a DB or other apps. Any app can behave badly and grab too muchRAM and access it regularly enough for it to not be 'idle' much but give the OS VM fits trying to figure out if its importantor not versus other processes. > Now I agree that it would be good to get the entire index (or at least the > working set of the index) into the memory of the computer. But does it > really matter if it is in the system's cache, or the postgres cache? Is it > any more likely to be in postgres's cache than in the system cache if the > system is hurting for memory? I would think the system would be equally > likely to page out "idle" pages no matter where they are unless they are > locked to memory, and I do not know if all operating systems can do this, > and even if they can, I do not know if postgres uses that ability. I doubt > it, since I believe (at least in Linux) a process can do that only if run as > root, which I imagine few (if any) users do. The problem, is when none of them are really 'idle'. When the OS has to decide which pages, all of which have been accessedrecently, to evict. Most OS's will make bad choices if the load is mixed random and sequential access, as they treatall pages equally with respect to freshness versus eviction. Another problem is that there IS a difference between being in postgres' cache and the OS cache. One is more expensive toretrieve than the other. Significantly. Aaccessing buffers in shared_buffers, in process, uses a good chunk less CPU (and data copy and shared buffer eviction overhead)than going over the sys call to the OS. And as far as I can tell, even after the 8.4 fadvise patch, all I/O is in block_size chunks. (hopefully I am wrong) My system is now CPU bound, the I/O can do sequential reads of more than 1.2GB/sec but Postgres can't do a seqscan 30% asfast because it eats up CPU like crazy just reading and identifying tuples. It does seqscans ~ 25% faster if its fromshared_buffers than from the OS's page cache though. Seqscans are between 250MB/sec and 400MB/sec peak, from mem ordisk, typically showing no more than 35% iostat utilization of the array if off disk -- so we run a few concurrently wherewe can. In addition to the fadvise patch, postgres needs to merge adjacent I/O's into larger ones to reduce the overhead. It onlyreally needs to merge up to sizes of about 128k or 256k, and gain a 8x to 16x drop in syscall overhead, and additionallypotentially save code trips down the shared buffer management code paths. At lest, thats my guess I haven'tlooked at any code and could be wrong. Additionally, the "If your operating system has any reasonable caching itself" comment earlier in this conversation --- Linux (2.6.18, Centos 5.2) does NOT. I can easily make it spend 100% CPU in system time trying to figure out what to dowith the system cache for an hour. Just do large seqscans with memory pressure from work_mem or other forces that theOS will not deem 'idle'. Once the requested memory is ~75% of the system total, it will freak out. Linux simply willnot give up that last 25% or so of the RAM for anything but page cache, even though the disk subsustem is very fast andmost of the access is sequential, marginalizing the benefit of the cache. Depending on how you tune it, it will eitherspin system cpu or swap storm, but the system cpu spin times for the same work load are a lot shorter than an equivalentswap storm. Mount the data drive in O_DIRECT and the problem vanishes. I've been told that this problem may be gone in some of the latestkernels. I have seriously considered bumping shared_buffers up a lot and mounting the thing direct -- but then welose the useful scheduler and readahead algorithms. The other way around (small shared_buffers, let the OS do it) hurtsperformance overall quite a bit -- randomly accessed pages get pushed out to the OS cache more often, and the OS tossesthouse out when a big seqscan occurs, resulting in a lot more random access from disk and more disk bound periods oftime. Great wonder, this operating system caching, eh? In any event, don't hold up these OS page cache things as if they're the best thing in the world for a database, they haveserious flaws themselves and typically are difficult or impossible to tune to be ideal for a database. Its one thing to propose that a database build its own file system (hard, and why bother?) versus have a database manageits own page cache intelligently and access the OS file system as optimally as it can. In both of the latter, theDB knows much more about what data is really important than the OS (and could for example, prioritize cache versus work_memintelligently while the OS can get that one horribly wrong in my experience, and knows when a huge seqscan occursto make caching those results low priority). No matter how you do it using the OS cache, you cache twice and copytwice. O_DIRECT isn't usually an option for other reasons, the OS disk scheduler, readahead, and other benefits of afile system are real and substantial. If you are caching twice, you might as well have the "closer" copy of that data bethe larger, more efficient pool. As for tipping points and pg_bench -- It doesn't seem to reflect the kind of workload we use postgres for at all, thoughmy workload does a lot of big hashes and seqscans, and I'm curious how much improved those may be due to the hash improvements. 32GB RAM and 3TB data (about 250GB scanned regularly) here. And yes, we are almost completely CPU bound nowexcept for a few tasks. Iostat only reports above 65% disk utilization for about 5% of the workload duty-cycle, and isregularly < 20%. COPY doesn't get anywhere near platter speeds, on indexless bulk transfer. The highest disk usage spikesoccur when some of our radom-access data/indexes get shoved out of cache. These aren't too large, but high enoughseqscan load will cause postgres and the OS to dump them from cache. If we put these on some SSD's the disk utilization% would drop a lot further. I feel confident in saying that in about a year, I could spec out a medium sized budget for hardware ($25k) for almost anypostgres setup and make it almost pure CPU bound. SSDs and hybrid tech such as ZFS L2ARC make this possible with easy access to 10k+ iops, and it it will take no more than12 SATA drives in raid 10 next year (and a good controller or software raid) to get 1GB/sec sequential reads.
On Tue, Dec 9, 2008 at 9:37 AM, Scott Carey <scott@richrelevance.com> wrote: > As for tipping points and pg_bench -- It doesn't seem to reflect the kind of workload we use postgres for at all, thoughmy workload does a lot of big hashes and seqscans, and I'm curious how much improved those may be due to the hash improvements. 32GB RAM and 3TB data (about 250GB scanned regularly) here. And yes, we are almost completely CPU bound nowexcept for a few tasks. Iostat only reports above 65% disk utilization for about 5% of the workload duty-cycle, and isregularly < 20%. COPY doesn't get anywhere near platter speeds, on indexless bulk transfer. The highest disk usage spikesoccur when some of our radom-access data/indexes get shoved out of cache. These aren't too large, but high enoughseqscan load will cause postgres and the OS to dump them from cache. If we put these on some SSD's the disk utilization% would drop a lot further. It definitely reflects our usage pattern, which is very random and involves tiny bits of data scattered throughout the database. Our current database is about 20-25 Gig, which means it's quickly reaching the point where it will not fit in our 32G of ram, and it's likely to grow too big for 64Gig before a year or two is out. > I feel confident in saying that in about a year, I could spec out a medium sized budget for hardware ($25k) for almostany postgres setup and make it almost pure CPU bound. > SSDs and hybrid tech such as ZFS L2ARC make this possible with easy access to 10k+ iops, and it it will take no more than12 SATA drives in raid 10 next year (and a good controller or software raid) to get 1GB/sec sequential reads. Lucky you, having needs that are fulfilled by sequential reads. :) I wonder how many hard drives it would take to be CPU bound on random access patterns? About 40 to 60? And probably 15k / SAS drives to boot. Cause that's what we're looking at in the next few years where I work.
On Tue, 2008-12-09 at 10:21 -0700, Scott Marlowe wrote: > On Tue, Dec 9, 2008 at 9:37 AM, Scott Carey <scott@richrelevance.com> wrote: > Lucky you, having needs that are fulfilled by sequential reads. :) > > I wonder how many hard drives it would take to be CPU bound on random > access patterns? About 40 to 60? And probably 15k / SAS drives to > boot. Cause that's what we're looking at in the next few years where > I work. I was able to achieve only 10-20% IO/Wait even after beating the heck out of the machine with 50 spindles (of course it does have 16 CPUs): http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/ > -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
> Lucky you, having needs that are fulfilled by sequential reads. :) > I wonder how many hard drives it would take to be CPU bound on random > access patterns? About 40 to 60? And probably 15k / SAS drives to > boot. Cause that's what we're looking at in the next few years where > I work. About $3000 worth of Intel --- mainstream SSD's = 240GB space (6 in raid 10) today, 2x to 3x that storage area in 1 year. Random reads are even easier, provided you don't need more than 500GB or so. And with something like ZFS + L2ARC you can back your data with large slow iops disks and have cache access to data withoutrequiring mirrors on the cache ($3k of ssds for that covers 2x the area, then).
On Tue, 9 Dec 2008, Scott Marlowe wrote: > I wonder how many hard drives it would take to be CPU bound on random > access patterns? About 40 to 60? And probably 15k / SAS drives to > boot. Cause that's what we're looking at in the next few years where > I work. There's a problem with that thinking. That is, in order to exercise many spindles, you will need to have just as many (if not more) concurrent requests. And if you have many concurrent requests, then you can spread them over multiple CPUs. So it's more a case of "How many hard drives PER CPU". It also becomes a matter of whether Postgres can scale that well. Matthew -- Those who do not understand Unix are condemned to reinvent it, poorly. -- Henry Spencer
On Tue, Dec 9, 2008 at 10:35 AM, Matthew Wakeling <matthew@flymine.org> wrote: > On Tue, 9 Dec 2008, Scott Marlowe wrote: >> >> I wonder how many hard drives it would take to be CPU bound on random >> access patterns? About 40 to 60? And probably 15k / SAS drives to >> boot. Cause that's what we're looking at in the next few years where >> I work. > > There's a problem with that thinking. That is, in order to exercise many > spindles, you will need to have just as many (if not more) concurrent > requests. And if you have many concurrent requests, then you can spread them > over multiple CPUs. So it's more a case of "How many hard drives PER CPU". > It also becomes a matter of whether Postgres can scale that well. For us, all that is true. We typically have a dozen or more concurrent requests running at once. We'll likely see that increase linearly with our increase in users over the next year or so. We bought the machines with dual quad core opterons knowing the 6,8 and 12 core opterons were due out on the same socket design in the next year or so and we could upgrade those too if needed. PostgreSQL seems to scale well in most tests I've seen to at least 16 cores, and after that it's anyone's guess. The Sparc Niagra seems capable of scaling to 32 threads on 8 cores with pgsql 8.2 quite well. I worry about the linux kernel scaling that well, and we might have to look at open solaris or something like the solaris kernel under ubuntu distro to get better scaling.
Let me re-phrase this.
For today, at 200GB or less of required space, and 500GB or less next year.
“Where we’re going, we don’t NEED spindles.”
Seriously, go down to the store and get 6 X25-M’s, they’re as cheap as $550 each and will be sub $500 soon. These are more than sufficient for all but heavy write workloads (each can withstand ~600+ TB of writes in a lifetime, and SMART will tell you before they go). 6 in a RAID 10 will give you 750MB/sec read, and equivalent MB/sec in random reads. I’ve tested them. Random writes are very very fast too, faster than any SAS drive.
Put this in your current system, and you won’t need to upgrade the RAM unless you need items in cache to reduce CPU load or need it for the work_mem space.
Spindles will soon be only for capacity and sequential access performance requirements. Solid state will be for IOPS, and I would argue that for most Postgres installations, already is (now that the Intel SSD drive, which does random writes and read/write concurrency well, has arrived — more such next gen drives are on the way).
On 12/9/08 9:28 AM, "Scott Carey" <scott@richrelevance.com> wrote:
For today, at 200GB or less of required space, and 500GB or less next year.
“Where we’re going, we don’t NEED spindles.”
Seriously, go down to the store and get 6 X25-M’s, they’re as cheap as $550 each and will be sub $500 soon. These are more than sufficient for all but heavy write workloads (each can withstand ~600+ TB of writes in a lifetime, and SMART will tell you before they go). 6 in a RAID 10 will give you 750MB/sec read, and equivalent MB/sec in random reads. I’ve tested them. Random writes are very very fast too, faster than any SAS drive.
Put this in your current system, and you won’t need to upgrade the RAM unless you need items in cache to reduce CPU load or need it for the work_mem space.
Spindles will soon be only for capacity and sequential access performance requirements. Solid state will be for IOPS, and I would argue that for most Postgres installations, already is (now that the Intel SSD drive, which does random writes and read/write concurrency well, has arrived — more such next gen drives are on the way).
On 12/9/08 9:28 AM, "Scott Carey" <scott@richrelevance.com> wrote:
> Lucky you, having needs that are fulfilled by sequential reads. :)
> I wonder how many hard drives it would take to be CPU bound on random
> access patterns? About 40 to 60? And probably 15k / SAS drives to
> boot. Cause that's what we're looking at in the next few years where
> I work.
About $3000 worth of Intel --- mainstream SSD's = 240GB space (6 in raid 10) today, 2x to 3x that storage area in 1 year.
Random reads are even easier, provided you don't need more than 500GB or so.
And with something like ZFS + L2ARC you can back your data with large slow iops disks and have cache access to data without requiring mirrors on the cache ($3k of ssds for that covers 2x the area, then).
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Tue, Dec 9, 2008 at 11:01 AM, Scott Carey <scott@richrelevance.com> wrote: > Let me re-phrase this. > > For today, at 200GB or less of required space, and 500GB or less next year. > > "Where we're going, we don't NEED spindles." Those intel SSDs sound compelling. I've been waiting for SSDs to get competitive price and performance wise for a while, and when the intels came out and I read the first benchmarks I immediately began scheming. Sadly, that was right after we're ordered our new 16 drive servers, and I didn't have time to try something new and hope it would work. Now that the servers are up and running, we'll probably look at adding the SSDs next summer before our high load period begins.
On Tue, 2008-12-09 at 11:08 -0700, Scott Marlowe wrote: > On Tue, Dec 9, 2008 at 11:01 AM, Scott Carey <scott@richrelevance.com> wrote: > > Let me re-phrase this. > > > > For today, at 200GB or less of required space, and 500GB or less next year. > > > > "Where we're going, we don't NEED spindles." > > Those intel SSDs sound compelling. I've been waiting for SSDs to get > competitive price and performance wise for a while, and when the > intels came out and I read the first benchmarks I immediately began > scheming. Sadly, that was right after we're ordered our new 16 drive > servers, and I didn't have time to try something new and hope it would > work. Now that the servers are up and running, we'll probably look at > adding the SSDs next summer before our high load period begins. > The idea of SSDs is interesting. However I think I will wait for all the other early adopters to figure out the problems before I start suggesting them to clients. Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a 3U with controller and battery backed cache for <$10k. Joshua D. Drake -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Tue, Dec 9, 2008 at 1:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a > 3U with controller and battery backed cache for <$10k. While I agree with your general sentiments about early adoption, etc (the intel ssd products are the first flash drives that appear to have real promise in the enterprise), the numbers tell a different story. A *single* X25-E will give similar sustained write IOPS as your tray for far less price and a much better worst case read latency. All this without the 25 sets of whizzing ball bearings, painful spin-up times, fanning, RAID controller firmware, and various other sundry technologies to make the whole thing work. The main issue that I see with flash SSD is if the promised wear lifetimes are believable in high load environments and the mechanism of failure (slowly degrade into read only) is accurate. So, at least in relative terms, 15k sas drives are not 'fast'. They are terribly, awfully, painfully slow. They are also not cheap in terms of $/IOPS. The end is near. merlin
On Tue, 2008-12-09 at 15:07 -0500, Merlin Moncure wrote: > On Tue, Dec 9, 2008 at 1:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > > Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a > > 3U with controller and battery backed cache for <$10k. > > While I agree with your general sentiments about early adoption, etc > (the intel ssd products are the first flash drives that appear to have > real promise in the enterprise), the numbers tell a different story. Oh I have read about them and I am excited. I am just saying that there are plenty of people who can take advantage of the unknown without the worry of the pain that can cause. My client, can't. > > The main issue that I see with flash SSD is if the promised wear > lifetimes are believable in high load environments and the mechanism > of failure (slowly degrade into read only) is accurate. > Right. > So, at least in relative terms, 15k sas drives are not 'fast'. They > are terribly, awfully, painfully slow. They are also not cheap in > terms of $/IOPS. The end is near. > No doubt about it. I give it 24 months tops. Joshua D. Drake > merlin > -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
Which brings this back around to the point I care the most about:
I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4’s lifetime, and become almost irrelevant in 8.5’s.
Becoming more CPU efficient will become very important, and for some, already is. The community needs to be proactive on this front.
This turns a lot of old assumptions on their head, from the database down through the OS and filesystem. We’re bound to run into many surprises due to this major shift in something that has had its performance characteristics taken for granted for decades.
> On 12/9/08 12:20 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote:
I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4’s lifetime, and become almost irrelevant in 8.5’s.
Becoming more CPU efficient will become very important, and for some, already is. The community needs to be proactive on this front.
This turns a lot of old assumptions on their head, from the database down through the OS and filesystem. We’re bound to run into many surprises due to this major shift in something that has had its performance characteristics taken for granted for decades.
> On 12/9/08 12:20 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote:
> > So, at least in relative terms, 15k sas drives are not 'fast'. They
> > are terribly, awfully, painfully slow. They are also not cheap in
> > terms of $/IOPS. The end is near.
> >
> No doubt about it. I give it 24 months tops.
>
> Joshua D. Drake
Scott Carey <scott@richrelevance.com> writes: > Which brings this back around to the point I care the most about: > I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4's lifetime, and becomealmost irrelevant in 8.5's. > Becoming more CPU efficient will become very important, and for some, already is. The community needs to be proactiveon this front. > This turns a lot of old assumptions on their head, from the database down through the OS and filesystem. We're bound torun into many surprises due to this major shift in something that has had its performance characteristics taken for grantedfor decades. Hmm ... I wonder whether this means that the current work on parallelizing I/O (the posix_fadvise patch in particular) is a dead end. Because what that is basically going to do is expend more CPU to improve I/O efficiency. If you believe this thesis then that's not the road we want to go down. regards, tom lane
> Hmm ... I wonder whether this means that the current work on > parallelizing I/O (the posix_fadvise patch in particular) is a dead > end. Because what that is basically going to do is expend more CPU > to improve I/O efficiency. If you believe this thesis then that's > not the road we want to go down. I don't believe the thesis. The gap between disk speeds and memory speeds may narrow over time, but I doubt it's likely to disappear altogether any time soon, and certainly not for all users. Besides which, I believe the CPU overhead of that patch is pretty darn small when the feature is not enabled. ...Robert
On Tue, 2008-12-09 at 17:38 -0500, Tom Lane wrote: > Scott Carey <scott@richrelevance.com> writes: > > Which brings this back around to the point I care the most about: > > I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4's lifetime, andbecome almost irrelevant in 8.5's. > > Becoming more CPU efficient will become very important, and for some, already is. The community needs to be proactiveon this front. > > This turns a lot of old assumptions on their head, from the database down through the OS and filesystem. We're boundto run into many surprises due to this major shift in something that has had its performance characteristics taken forgranted for decades. > > Hmm ... I wonder whether this means that the current work on > parallelizing I/O (the posix_fadvise patch in particular) is a dead > end. Because what that is basically going to do is expend more CPU > to improve I/O efficiency. If you believe this thesis then that's > not the road we want to go down. The per cpu performance increase against the per I/O system increase line is going to be vastly different. Anything that reduces overall I/O is going to help (remember, you can never go too fast). The idea that somehow I/O per second will diminish as the most common database performance factor is IMO a pipe dream. Even as good as SSDs are getting, they still have to go through the bus. Something CPUs are better at (especially those CPUs that connect to memory directly without the bus). In 5 years maybe, in the next two postgresql releases, not likely. Not to mention all of this is around the idea of a different class of hardware than 99% of our community will be running. Sincerely, Joshua D. Drake > > regards, tom lane > -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Tue, 9 Dec 2008, Robert Haas wrote: > I don't believe the thesis. The gap between disk speeds and memory > speeds may narrow over time, but I doubt it's likely to disappear > altogether any time soon, and certainly not for all users. I think the "not for all users" is the critical part. In 2 years, we may (or may not) start using SSD instead of traditional drives for new installs, but we certainly won't be throwing out our existing servers any time soon just because something (much) better is now available.
Tom Lane wrote:
What does the CPU/ Memory/Bus performance road map look like?
Is the IO performance for storage device for what ever it be, going to be on par with the above to cause this problem?
Once IO performance numbers start jumping up I think DBA will have the temptation start leaving more and more data in the production database instead of moving it out of the production database. Or start consolidating databases onto fewer servers . Again pushing more load onto the IO.
Scott Carey <scott@richrelevance.com> writes:Which brings this back around to the point I care the most about: I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4's lifetime, and become almost irrelevant in 8.5's. Becoming more CPU efficient will become very important, and for some, already is. The community needs to be proactive on this front. This turns a lot of old assumptions on their head, from the database down through the OS and filesystem. We're bound to run into many surprises due to this major shift in something that has had its performance characteristics taken for granted for decades.Hmm ... I wonder whether this means that the current work on parallelizing I/O (the posix_fadvise patch in particular) is a dead end. Because what that is basically going to do is expend more CPU to improve I/O efficiency. If you believe this thesis then that's not the road we want to go down. regards, tom lane
What does the CPU/ Memory/Bus performance road map look like?
Is the IO performance for storage device for what ever it be, going to be on par with the above to cause this problem?
Once IO performance numbers start jumping up I think DBA will have the temptation start leaving more and more data in the production database instead of moving it out of the production database. Or start consolidating databases onto fewer servers . Again pushing more load onto the IO.
Prefetch CPU cost should be rather low in the grand scheme of things, and does help performance even for very fast I/O. I would not expect a very large CPU use increase from that sort of patch in the grand scheme of things — there is a lot that is more expensive to do on a per block basis.
There are two ways to look at non-I/O bound performance:
* Aggregate performance across many concurrent activities — here you want the least CPU used possible per action, and the least collisions on locks or shared data structures. Using resources for as short of an interval as possible also helps a lot here.
* Single query performance, where you want to shorten the query time, perhaps at the cost of more average CPU. Here, something like the fadvise stuff helps — as would any thread parallelism. Perhaps less efficient in aggregate, but more efficient for a single query.
Overall CPU cost of accessing and reading data. If this comes from disk, the big gains will be along the whole chain: Driver to file system cache, file system cache to process, process specific tasks (cache eviction, placement, tracking), examining page tuples, locating tuples within pages, etc. Anything that currently occurs on a per-block basis that could be done in a larger batch or set of blocks may be a big gain. Another place that commonly consumes CPU in larger software projects is memory allocation if more advanced allocation techniques are not used. I have no idea what Postgres uses here however. I do know that commercial databases have extensive work in this area for performance, as well as reliability (harder to cause a leak, or easier to detect) and ease of use (don’t have to even bother to free in certain contexts).
This is in fact, central to my point. Things will be CPU bound, not I/O bound. It is mentioned that we still have to access things over the bus, and memory is faster, etc. But Postgres is too CPU bound on page access to take advantage of the fact that memory is faster (for reading data pages).
The biggest change is not just that disks are getting closer to RAM, but that the random I/O penalty is diminishing significantly. Low latencies makes seek-driven queries that used to consume mostly disk time consume CPU time instead. High CPU costs for accessing pages makes a fast disk surprisingly close to RAM speed.
There are two ways to look at non-I/O bound performance:
* Aggregate performance across many concurrent activities — here you want the least CPU used possible per action, and the least collisions on locks or shared data structures. Using resources for as short of an interval as possible also helps a lot here.
* Single query performance, where you want to shorten the query time, perhaps at the cost of more average CPU. Here, something like the fadvise stuff helps — as would any thread parallelism. Perhaps less efficient in aggregate, but more efficient for a single query.
Overall CPU cost of accessing and reading data. If this comes from disk, the big gains will be along the whole chain: Driver to file system cache, file system cache to process, process specific tasks (cache eviction, placement, tracking), examining page tuples, locating tuples within pages, etc. Anything that currently occurs on a per-block basis that could be done in a larger batch or set of blocks may be a big gain. Another place that commonly consumes CPU in larger software projects is memory allocation if more advanced allocation techniques are not used. I have no idea what Postgres uses here however. I do know that commercial databases have extensive work in this area for performance, as well as reliability (harder to cause a leak, or easier to detect) and ease of use (don’t have to even bother to free in certain contexts).
> On 12/9/08 2:58 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> I don't believe the thesis. The gap between disk speeds and memoryWell, when select count(1) reads pages slower than my disk, its 16x + slower than my RAM. Until one can demonstrate that the system can even read pages in RAM faster than what disks will do next year, it doesn’t matter much that RAM is faster. It does matter that RAM is faster for sorts, hashes, and other operations, but at the current time it does not for the raw pages themselves, from what I can measure.
> speeds may narrow over time, but I doubt it's likely to disappear
> altogether any time soon, and certainly not for all users.
This is in fact, central to my point. Things will be CPU bound, not I/O bound. It is mentioned that we still have to access things over the bus, and memory is faster, etc. But Postgres is too CPU bound on page access to take advantage of the fact that memory is faster (for reading data pages).
The biggest change is not just that disks are getting closer to RAM, but that the random I/O penalty is diminishing significantly. Low latencies makes seek-driven queries that used to consume mostly disk time consume CPU time instead. High CPU costs for accessing pages makes a fast disk surprisingly close to RAM speed.
I doubt it is much CPU, on or off. It will help with SSD’s when optimizing a single query, it may not help much if a system has enough ‘natural’ parallelism from other concurrent queries. However there is a clear CPU benefit for getting individual queries out of the way faster, and occupying precious work_mem or other resources for a shorter time. Occupying resources for a shorter period always translates to some CPU savings on a machine running at its limit with high concurrency.
> Besides which, I believe the CPU overhead of that patch is pretty darn
> small when the feature is not enabled.
> ...Robert
Scott Carey <scott@richrelevance.com> writes: > And as far as I can tell, even after the 8.4 fadvise patch, all I/O is in > block_size chunks. (hopefully I am wrong) >... > In addition to the fadvise patch, postgres needs to merge adjacent I/O's > into larger ones to reduce the overhead. It only really needs to merge up to > sizes of about 128k or 256k, and gain a 8x to 16x drop in syscall overhead, > and additionally potentially save code trips down the shared buffer > management code paths. At lest, thats my guess I haven't looked at any code > and could be wrong. There are a lot of assumptions here that I would be interested in seeing experiments to back up. FWIW when I was doing testing of posix_fadvise I did a *lot* of experiments though only with a couple systems. One had a 3-drive array and one with a 15-drive array, both running Linux. I sometimes could speed up the sequential scan by about 10% but not consistently. It was never more than about 15% shy of the highest throughput from dd. And incidentally the throughput from dd didn't seem to depend much at all on the blocksize. On your system does "dd bs=8k" and "dd bs=128k" really have an 8x performance difference? In short, at least from the evidence available, this all seems like it might be holdover beliefs from the olden days of sysadmining where syscalls were much slower and OS filesystem caches much dumber. I'm still interested in looking into it but I'll have to see actual vmstat or iostat output while it's happening, preferably some oprofile results too. And how many drives do you actually need to get into this situation. Also, what is the output of "vacuum verbose" on the table? > Additionally, the "If your operating system has any reasonable caching > itself" comment earlier in this conversation --- Linux (2.6.18, Centos 5.2) > does NOT. I can easily make it spend 100% CPU in system time trying to > figure out what to do with the system cache for an hour. Just do large > seqscans with memory pressure from work_mem or other forces that the OS will > not deem 'idle'. Once the requested memory is ~75% of the system total, it > will freak out. Linux simply will not give up that last 25% or so of the RAM > for anything but page cache This seems like just a misconfigured system. Linux and most Unixen definitely expect to have a substantial portion of RAM dedicated to disk cache. Keep in mind all your executable pages count towards this page cache too. You can adjust this to some extent with the "swappiness" variable in Linux -- but I doubt you'll be happy with the results regardless. > The other way around (small shared_buffers, let the OS do it) hurts > performance overall quite a bit -- randomly accessed pages get pushed out to > the OS cache more often, and the OS tosses thouse out when a big seqscan > occurs, resulting in a lot more random access from disk and more disk bound > periods of time. Great wonder, this operating system caching, eh? How do you observe this? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
Matthew Wakeling <matthew@flymine.org> writes: > On Tue, 9 Dec 2008, Scott Marlowe wrote: >> I wonder how many hard drives it would take to be CPU bound on random >> access patterns? About 40 to 60? And probably 15k / SAS drives to >> boot. Cause that's what we're looking at in the next few years where >> I work. > > There's a problem with that thinking. That is, in order to exercise many > spindles, you will need to have just as many (if not more) concurrent requests. > And if you have many concurrent requests, then you can spread them over > multiple CPUs. So it's more a case of "How many hard drives PER CPU". It also > becomes a matter of whether Postgres can scale that well. Well: $ units 2445 units, 71 prefixes, 33 nonlinear units You have: 8192 byte/5ms You want: MB/s * 1.6384 / 0.61035156 At 1.6MB/s per drive if find Postgres is cpu-bound doing sequential scans at 1GB/s you'll need about 640 drives to keep one cpu satisfied doing random I/O -- assuming you have perfect read-ahead and the read-ahead itself doesn't add cpu overhead. Both of which are false of course, but at least in theory that's what it'll take. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Tom Lane wrote: > Scott Carey <scott@richrelevance.com> writes: >> Which brings this back around to the point I care the most about: >> I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4's lifetime, and becomealmost irrelevant in 8.5's. >> Becoming more CPU efficient will become very important, and for some, already is. The community needs to be proactiveon this front. >> This turns a lot of old assumptions on their head, from the database down through the OS and filesystem. We're boundto run into many surprises due to this major shift in something that has had its performance characteristics taken forgranted for decades. > > Hmm ... I wonder whether this means that the current work on > parallelizing I/O (the posix_fadvise patch in particular) is a dead > end. Because what that is basically going to do is expend more CPU > to improve I/O efficiency. If you believe this thesis then that's > not the road we want to go down. > I imagine the larger postgres installations will still benefit from this patch - because I imagine they will stay on hard disks for quite some time; simply because the cost of 70TB of disks seems like it'll be lower than RAM for at least the intermediate term. I imagine the smaller postgres installations will also still benefit from this patch - because my postgres installations with the most painful I/O bottlenecks are small virtual machines without dedicated drives where I/O (I guess emulated by the virtual machine software) is very painful. Perhaps there's a mid-sized system that won't benefit from fadvise() in the intermediate term -- where the size of the database is about the same size as a cost-effective flash drive -- but I don't have any databases in that range now.
justin wrote: > Tom Lane wrote: >> Hmm ... I wonder whether this means that the current work on >> parallelizing I/O (the posix_fadvise patch in particular) is a dead >> end. Because what that is basically going to do is expend more CPU >> to improve I/O efficiency. If you believe this thesis then that's >> not the road we want to go down. >> >> regards, tom lane >> > What does the CPU/ Memory/Bus performance road map look like? > > Is the IO performance for storage device for what ever it be, going to > be on par with the above to cause this problem? Flash memory will become just a fourth layer in the memory caching system (on-board CPU, high-speed secondary cache, mainmemory, and persistent memory). The idea of external storage will probably disappear altogether -- computers will justhave memory, and won't forget anything when you turn them off. Since most computers are 64 bits these days, all dataand programs will just hang out in memory at all times, and be directly addressable by the CPU. The distinction between disk and memory arose from the fact that disks were large, slow devices relative to "core" memoryand had to be connected by long wires, hence the need for I/O subsystems. As flash memory becomes mainstream, I expectthis distinction to disappear. Craig
> Well, when select count(1) reads pages slower than my disk, its 16x + slower > than my RAM. Until one can demonstrate that the system can even read pages > in RAM faster than what disks will do next year, it doesn't matter much that > RAM is faster. It does matter that RAM is faster for sorts, hashes, and > other operations, but at the current time it does not for the raw pages > themselves, from what I can measure. > > This is in fact, central to my point. Things will be CPU bound, not I/O > bound. It is mentioned that we still have to access things over the bus, > and memory is faster, etc. But Postgres is too CPU bound on page access to > take advantage of the fact that memory is faster (for reading data pages). As I understand it, a big part of the reason for the posix_fadvise patch is that the current system doesn't do a good job leveraging many spindles in the service of a single query. So the problem is not that the CPU overhead is too large in some general sense but that the disk and CPU operations get serialized, leading to an overall loss of performance. On the other hand, there are certainly cases (such as a database which is entirely in RAM, or all the commonly used parts are in RAM) where there really isn't very much I/O, and in those cases of course the CPU cost will dominate. ...Robert
Just to clarify, I’m not talking about random I/O bound loads today, on hard drives, targetted by the fadvise stuff — these aren’t CPU bound, and they will be helped by it.
For sequential scans, this situation is different, since the OS has sufficient read-ahead prefetching algorithms of its own for sequential reads, and the CPU work and I/O work ends up happening in parallel due to that.
For what it is worth, you can roughly double to triple the iops of an Intel X-25M on pure random reads if you queue up multiple concurrent reads rather than serialize them. But it is not due to spindles, it is due to the latency of the SATA interface and the ability of the controller chip to issue reads to flash devices on different banks concurrently to some extent.
On 12/9/08 7:06 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
For sequential scans, this situation is different, since the OS has sufficient read-ahead prefetching algorithms of its own for sequential reads, and the CPU work and I/O work ends up happening in parallel due to that.
For what it is worth, you can roughly double to triple the iops of an Intel X-25M on pure random reads if you queue up multiple concurrent reads rather than serialize them. But it is not due to spindles, it is due to the latency of the SATA interface and the ability of the controller chip to issue reads to flash devices on different banks concurrently to some extent.
On 12/9/08 7:06 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> Well, when select count(1) reads pages slower than my disk, its 16x + slower
> than my RAM. Until one can demonstrate that the system can even read pages
> in RAM faster than what disks will do next year, it doesn't matter much that
> RAM is faster. It does matter that RAM is faster for sorts, hashes, and
> other operations, but at the current time it does not for the raw pages
> themselves, from what I can measure.
>
> This is in fact, central to my point. Things will be CPU bound, not I/O
> bound. It is mentioned that we still have to access things over the bus,
> and memory is faster, etc. But Postgres is too CPU bound on page access to
> take advantage of the fact that memory is faster (for reading data pages).
As I understand it, a big part of the reason for the posix_fadvise
patch is that the current system doesn't do a good job leveraging many
spindles in the service of a single query. So the problem is not that
the CPU overhead is too large in some general sense but that the disk
and CPU operations get serialized, leading to an overall loss of
performance. On the other hand, there are certainly cases (such as a
database which is entirely in RAM, or all the commonly used parts are
in RAM) where there really isn't very much I/O, and in those cases of
course the CPU cost will dominate.
...Robert
I did some further tests, that alter some of my statements below.
Mainly:
* I can do select count(1) queries at closer to disk speeds than my older tests (on a different machine) indicated. I can get ~800MB/sec where the disks can do 1200MB/sec and other single process tasks can go 1100MB/sec. It is mostly CPU bound during these tests.
Accessing from within shared_buffers rather than from the OS cache is faster 20% ish, but it is hard to measure consistently (hard for me to be certain its in that cache).
Slightly more complicated scan queries (simple group by aggregates, returning < 1000 rows) slow down to ~1/3 disk speed, depending on various factors that are not worth getting into. So, postgres is a less CPU bound currently than I thought.
* The linux caching 100% system CPU spin issue seems related to something pinning memory on my machine. Not sure what it is yet, the only thing running on the machine is postgres, and the size of the pinned memory is roughly equal to shared_buffers.
* I have some tests that conflict with prior results, that seem to depend on whether postgres is on or off when I run the disk benchmark. This may be related to the pinned memory above. Specifically, larger block size reads reduce overall CPU usage more if there isn’t anything else running on the system than if there is — attributable to more time spent in kswapd when postgres is on (and mostly idle). Further investigation is needed here.
On 12/9/08 8:37 AM, "Scott Carey" <scott@richrelevance.com> wrote:
Mainly:
* I can do select count(1) queries at closer to disk speeds than my older tests (on a different machine) indicated. I can get ~800MB/sec where the disks can do 1200MB/sec and other single process tasks can go 1100MB/sec. It is mostly CPU bound during these tests.
Accessing from within shared_buffers rather than from the OS cache is faster 20% ish, but it is hard to measure consistently (hard for me to be certain its in that cache).
Slightly more complicated scan queries (simple group by aggregates, returning < 1000 rows) slow down to ~1/3 disk speed, depending on various factors that are not worth getting into. So, postgres is a less CPU bound currently than I thought.
* The linux caching 100% system CPU spin issue seems related to something pinning memory on my machine. Not sure what it is yet, the only thing running on the machine is postgres, and the size of the pinned memory is roughly equal to shared_buffers.
* I have some tests that conflict with prior results, that seem to depend on whether postgres is on or off when I run the disk benchmark. This may be related to the pinned memory above. Specifically, larger block size reads reduce overall CPU usage more if there isn’t anything else running on the system than if there is — attributable to more time spent in kswapd when postgres is on (and mostly idle). Further investigation is needed here.
On 12/9/08 8:37 AM, "Scott Carey" <scott@richrelevance.com> wrote:
> ________________________________________
> From: pgsql-performance-owner@postgresql.org [pgsql-performance-owner@postgresql.org] On Behalf Of > Jean-David Beyer [jeandavid8@verizon.net]
> Sent: Tuesday, December 09, 2008 5:08 AM
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Need help with 8.4 Performance Testing
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> But one thing that can matter is whether you want to improve just the
> performance of the dbms, or the performance of the entire system, on which
> the dbms runs. Because if you want to improve the whole system, you would
> want as much of the caching to take place in the system's buffers so the use
> of the memory could be optimized over the entire workload, not just the load
> of the dbms itself. I suppose on a dedicated system with only one dbms
> running with only one database open (at a time, anyway), this might be moot,
> but not otherwise.
Yes, the OS is in better position to arbitrate between multiple things. Of course, we aren't talking about the highest performance databases if we are talking about mixed use systems though.
Additionally, the OS can never really get it right, with a DB or other apps. Any app can behave badly and grab too much RAM and access it regularly enough for it to not be 'idle' much but give the OS VM fits trying to figure out if its important or not versus other processes.
> Now I agree that it would be good to get the entire index (or at least the
> working set of the index) into the memory of the computer. But does it
> really matter if it is in the system's cache, or the postgres cache? Is it
> any more likely to be in postgres's cache than in the system cache if the
> system is hurting for memory? I would think the system would be equally
> likely to page out "idle" pages no matter where they are unless they are
> locked to memory, and I do not know if all operating systems can do this,
> and even if they can, I do not know if postgres uses that ability. I doubt
> it, since I believe (at least in Linux) a process can do that only if run as
> root, which I imagine few (if any) users do.
The problem, is when none of them are really 'idle'. When the OS has to decide which pages, all of which have been accessed recently, to evict. Most OS's will make bad choices if the load is mixed random and sequential access, as they treat all pages equally with respect to freshness versus eviction.
Another problem is that there IS a difference between being in postgres' cache and the OS cache. One is more expensive to retrieve than the other. Significantly.
Aaccessing buffers in shared_buffers, in process, uses a good chunk less CPU (and data copy and shared buffer eviction overhead) than going over the sys call to the OS.
And as far as I can tell, even after the 8.4 fadvise patch, all I/O is in block_size chunks. (hopefully I am wrong)
My system is now CPU bound, the I/O can do sequential reads of more than 1.2GB/sec but Postgres can't do a seqscan 30% as fast because it eats up CPU like crazy just reading and identifying tuples. It does seqscans ~ 25% faster if its from shared_buffers than from the OS's page cache though. Seqscans are between 250MB/sec and 400MB/sec peak, from mem or disk, typically showing no more than 35% iostat utilization of the array if off disk -- so we run a few concurrently where we can.
In addition to the fadvise patch, postgres needs to merge adjacent I/O's into larger ones to reduce the overhead. It only really needs to merge up to sizes of about 128k or 256k, and gain a 8x to 16x drop in syscall overhead, and additionally potentially save code trips down the shared buffer management code paths. At lest, thats my guess I haven't looked at any code and could be wrong.
Additionally, the "If your operating system has any reasonable caching itself" comment earlier in this conversation --- Linux (2.6.18, Centos 5.2) does NOT. I can easily make it spend 100% CPU in system time trying to figure out what to do with the system cache for an hour. Just do large seqscans with memory pressure from work_mem or other forces that the OS will not deem 'idle'. Once the requested memory is ~75% of the system total, it will freak out. Linux simply will not give up that last 25% or so of the RAM for anything but page cache, even though the disk subsustem is very fast and most of the access is sequential, marginalizing the benefit of the cache. Depending on how you tune it, it will either spin system cpu or swap storm, but the system cpu spin times for the same work load are a lot shorter than an equivalent swap storm.
Mount the data drive in O_DIRECT and the problem vanishes. I've been told that this problem may be gone in some of the latest kernels. I have seriously considered bumping shared_buffers up a lot and mounting the thing direct -- but then we lose the useful scheduler and readahead algorithms. The other way around (small shared_buffers, let the OS do it) hurts performance overall quite a bit -- randomly accessed pages get pushed out to the OS cache more often, and the OS tosses thouse out when a big seqscan occurs, resulting in a lot more random access from disk and more disk bound periods of time. Great wonder, this operating system caching, eh?
In any event, don't hold up these OS page cache things as if they're the best thing in the world for a database, they have serious flaws themselves and typically are difficult or impossible to tune to be ideal for a database.
Its one thing to propose that a database build its own file system (hard, and why bother?) versus have a database manage its own page cache intelligently and access the OS file system as optimally as it can. In both of the latter, the DB knows much more about what data is really important than the OS (and could for example, prioritize cache versus work_mem intelligently while the OS can get that one horribly wrong in my experience, and knows when a huge seqscan occurs to make caching those results low priority). No matter how you do it using the OS cache, you cache twice and copy twice. O_DIRECT isn't usually an option for other reasons, the OS disk scheduler, readahead, and other benefits of a file system are real and substantial. If you are caching twice, you might as well have the "closer" copy of that data be the larger, more efficient pool.
As for tipping points and pg_bench -- It doesn't seem to reflect the kind of workload we use postgres for at all, though my workload does a lot of big hashes and seqscans, and I'm curious how much improved those may be due to the hash improvements. 32GB RAM and 3TB data (about 250GB scanned regularly) here. And yes, we are almost completely CPU bound now except for a few tasks. Iostat only reports above 65% disk utilization for about 5% of the workload duty-cycle, and is regularly < 20%. COPY doesn't get anywhere near platter speeds, on indexless bulk transfer. The highest disk usage spikes occur when some of our radom-access data/indexes get shoved out of cache. These aren't too large, but high enough seqscan load will cause postgres and the OS to dump them from cache. If we put these on some SSD's the disk utilization % would drop a lot further.
I feel confident in saying that in about a year, I could spec out a medium sized budget for hardware ($25k) for almost any postgres setup and make it almost pure CPU bound.
SSDs and hybrid tech such as ZFS L2ARC make this possible with easy access to 10k+ iops, and it it will take no more than 12 SATA drives in raid 10 next year (and a good controller or software raid) to get 1GB/sec sequential reads.
On Tue, 9 Dec 2008, Scott Carey wrote: > For what it is worth, you can roughly double to triple the iops of an Intel X-25M on pure random reads if you queue up > multiple concurrent reads rather than serialize them. But it is not due to spindles, it is due to the latency of the > SATA interface and the ability of the controller chip to issue reads to flash devices on different banks concurrently > to some extent. Spindles, banks. What's the difference? The fact is that you have multiple "things", and queuing up requests in the controller means that more than one at a time can be active. The X-25M is basically a RAID controller in its own right, connected to ten flash devices. Matthew -- Don't worry! The world can't end today because it's already tomorrow in Australia.
> I would expect higher shared_buffers to raise the curve before the first > breakpoint but after the first breakpoint make the drop steeper and deeper. > The equilibrium where the curve becomes flatter should be lower. On SpecJAppserver specifically, I remember seeing a drop when the database size grew beyond the size of shared_buffers. --Josh
Greg Stark wrote: > On Sun, Dec 7, 2008 at 7:38 PM, Josh Berkus <josh@agliodbs.com> wrote: >> Also, the following patches currently still have bugs, but when the bugs are >> fixed I'll be looking for performance testers, so please either watch the >> wiki or watch this space: >> ... >> -- posix_fadvise (Gregory Stark) > > Eh? Quite possibly but none that I'm aware of. The only problem is a > couple of trivial bits of bitrot. I'll a post an update now if you > want. I'm just going off the status on hackers archives. I didn't actually try to build it before posting that. If you have an updated patch, link on CommitFest page? Thanks. --Josh
Tom, > Hmm ... I wonder whether this means that the current work on > parallelizing I/O (the posix_fadvise patch in particular) is a dead > end. Because what that is basically going to do is expend more CPU > to improve I/O efficiency. If you believe this thesis then that's > not the road we want to go down. Nope. People who adminster small databases keep forgetting that there is another class of users with multiple terabytes of data. Those users aren't getting away from spinning disk anytime in the next 5 years. Additionally, but making PostgreSQL work better with OS-based FS optimization, we are well positioned to take advantage of any special features which Linux, Solaris, BSD etc. add to utilize new hardware like SSDs. posix_fadvise is a great example of this. --Josh
Scott Marlowe wrote: > involves tiny bits of data scattered throughout the database. Our > current database is about 20-25 Gig, which means it's quickly reaching > the point where it will not fit in our 32G of ram, and it's likely to > grow too big for 64Gig before a year or two is out. > ... > I wonder how many hard drives it would take to be CPU bound on random > access patterns? About 40 to 60? And probably 15k / SAS drives to > Well, its not a very big database and you're seek bound - so what's wrong with the latest generation flash drives? They're perfect for what you want to do it seems, and you can probably get what you need using the new ARC cache on flash stuff in ZFS.
On Tue, 9 Dec 2008, Scott Carey wrote: > My system is now CPU bound, the I/O can do sequential reads of more than > 1.2GB/sec but Postgres can't do a seqscan 30% as fast because it eats up > CPU like crazy just reading and identifying tuples... In addition to the > fadvise patch, postgres needs to merge adjacent I/O's into larger ones > to reduce the overhead. Do you have any profile data to back that up? I think it's more likely that bottlenecks are on the tuple processing side of things as you also suggested. There's really no sense guessing; one quick session with something like oprofile would be more informative than any amount of speculation on what's going on. > Additionally, the "If your operating system has any reasonable caching > itself" comment earlier in this conversation --- Linux (2.6.18, Centos > 5.2) does NOT. I can easily make it spend 100% CPU in system time > trying to figure out what to do with the system cache for an hour. Have you ever looked into how much memory ends up showing up as "Writeback" in /proc/meminfo when this happens? The biggest problem with that kernel out of the box on the kind of workload you're describing is that it will try and buffer way too much on the write side by default, which can easily get you into the sort of ugly situations you describe. I regularly adjust that kernel to lower dirty_ratio in particular dramatically from the default of 40 to keep that from happening. I did a whole blog entry on one of those if you're not familiar with this particular bit of painful defaults already: http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html > I feel confident in saying that in about a year, I could spec out a > medium sized budget for hardware ($25k) for almost any postgres setup > and make it almost pure CPU bound. The largest database I manage is running on a Sun X4500, which is right at that price point. I've never seen it not be CPU bound. Even though people are pulling data that's spread across a few TB of disk, the only time I ever see it straining to keep up with something there's always a single CPU pegged. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, Dec 9, 2008 at 1:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote: >> Those intel SSDs sound compelling. I've been waiting for SSDs to get >> competitive price and performance wise for a while, and when the >> intels came out and I read the first benchmarks I immediately began >> scheming. Sadly, that was right after we're ordered our new 16 drive >> servers, and I didn't have time to try something new and hope it would >> work. Now that the servers are up and running, we'll probably look at >> adding the SSDs next summer before our high load period begins. > > The idea of SSDs is interesting. However I think I will wait for all the > other early adopters to figure out the problems before I start > suggesting them to clients. > > Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a > 3U with controller and battery backed cache for <$10k. This may be a little off-topic, but I'd be interested in hearing more details about how you (or others) would do this... manufacturer, model, configuration? How many hard drives do you need to get 25 spindles? And where can you get that many 15K hard drives for under $10K? My lack of experience in this area is showing here, but, seriously, any suggestions appreciated. ...Robert
On Sat, 13 Dec 2008, Robert Haas wrote: > On Tue, Dec 9, 2008 at 1:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote: >>> Those intel SSDs sound compelling. I've been waiting for SSDs to get >>> competitive price and performance wise for a while, and when the >>> intels came out and I read the first benchmarks I immediately began >>> scheming. Sadly, that was right after we're ordered our new 16 drive >>> servers, and I didn't have time to try something new and hope it would >>> work. Now that the servers are up and running, we'll probably look at >>> adding the SSDs next summer before our high load period begins. >> >> The idea of SSDs is interesting. However I think I will wait for all the >> other early adopters to figure out the problems before I start >> suggesting them to clients. >> >> Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a >> 3U with controller and battery backed cache for <$10k. > > This may be a little off-topic, but I'd be interested in hearing more > details about how you (or others) would do this... manufacturer, > model, configuration? How many hard drives do you need to get 25 > spindles? And where can you get that many 15K hard drives for under > $10K? My lack of experience in this area is showing here, but, > seriously, any suggestions appreciated. 1 spindle == 1 hard drive David Lang
On Sat, Dec 13, 2008 at 6:22 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Dec 9, 2008 at 1:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote: >>> Those intel SSDs sound compelling. I've been waiting for SSDs to get >>> competitive price and performance wise for a while, and when the >>> intels came out and I read the first benchmarks I immediately began >>> scheming. Sadly, that was right after we're ordered our new 16 drive >>> servers, and I didn't have time to try something new and hope it would >>> work. Now that the servers are up and running, we'll probably look at >>> adding the SSDs next summer before our high load period begins. >> >> The idea of SSDs is interesting. However I think I will wait for all the >> other early adopters to figure out the problems before I start >> suggesting them to clients. >> >> Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a >> 3U with controller and battery backed cache for <$10k. > > This may be a little off-topic, but I'd be interested in hearing more > details about how you (or others) would do this... manufacturer, > model, configuration? How many hard drives do you need to get 25 > spindles? And where can you get that many 15K hard drives for under > $10K? My lack of experience in this area is showing here, but, > seriously, any suggestions appreciated. I don't know about fitting in a 3U, but you can get a 24 15k 73G SAS drive machine with 8 gig ram and a single quad core CPU from aberdeeninc.com for $10,055.00.
On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote: > On Sat, 13 Dec 2008, Robert Haas wrote: > > This may be a little off-topic, but I'd be interested in hearing more > > details about how you (or others) would do this... manufacturer, > > model, configuration? How many hard drives do you need to get 25 > > spindles? And where can you get that many 15K hard drives for under > > $10K? My lack of experience in this area is showing here, but, > > seriously, any suggestions appreciated. > http://h71016.www7.hp.com/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2570&LowBaseId=15222&LowPrice=$1,899.00&familyviewgroup=757&viewtype=Matrix Retail cost, 15k, 36GB drives, ~ 12k. A phone call and threat of buying dell gets its for ~ 10k. Joshua D. Drake -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote: >> On Sat, 13 Dec 2008, Robert Haas wrote: > >> > This may be a little off-topic, but I'd be interested in hearing more >> > details about how you (or others) would do this... manufacturer, >> > model, configuration? How many hard drives do you need to get 25 >> > spindles? And where can you get that many 15K hard drives for under >> > $10K? My lack of experience in this area is showing here, but, >> > seriously, any suggestions appreciated. >> > > http://h71016.www7.hp.com/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2570&LowBaseId=15222&LowPrice=$1,899.00&familyviewgroup=757&viewtype=Matrix > > Retail cost, 15k, 36GB drives, ~ 12k. A phone call and threat of buying > dell gets its for ~ 10k. I prefer to deal with companies that I don't have to horse trade with to get a good deal. You can threaten dell and get good deals, but if you're not watching, and let your boss handle an order suddenly you've got an $800 Perc 5e on your hands.
On Sat, 2008-12-13 at 12:45 -0700, Scott Marlowe wrote: > On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote: > > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote: > >> On Sat, 13 Dec 2008, Robert Haas wrote: > > http://h71016.www7.hp.com/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2570&LowBaseId=15222&LowPrice=$1,899.00&familyviewgroup=757&viewtype=Matrix > > > > Retail cost, 15k, 36GB drives, ~ 12k. A phone call and threat of buying > > dell gets its for ~ 10k. > > I prefer to deal with companies that I don't have to horse trade with > to get a good deal. You can threaten dell and get good deals, but if And what company would that be? There is zero major server manufacturer that doesn't do the, "Oh you have a competitive bid... let's just lower that quote for you" Note: HP can beat Dell, every time on an apples to apples quote. At least when I have done it. Joshua D. Drake > -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
On Sat, Dec 13, 2008 at 12:47 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Sat, 2008-12-13 at 12:45 -0700, Scott Marlowe wrote: >> On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote: >> > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote: >> >> On Sat, 13 Dec 2008, Robert Haas wrote: > >> > http://h71016.www7.hp.com/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2570&LowBaseId=15222&LowPrice=$1,899.00&familyviewgroup=757&viewtype=Matrix >> > >> > Retail cost, 15k, 36GB drives, ~ 12k. A phone call and threat of buying >> > dell gets its for ~ 10k. >> >> I prefer to deal with companies that I don't have to horse trade with >> to get a good deal. You can threaten dell and get good deals, but if > > And what company would that be? There is zero major server manufacturer > that doesn't do the, "Oh you have a competitive bid... let's just lower > that quote for you" > > Note: HP can beat Dell, every time on an apples to apples quote. At > least when I have done it. Aberdeen inc, the one I listed in my previous response. Their on site price for a single quad core xeon, 8 gig 800MHZ ram and 24 15k5 SAS drives is $10,080 or so. I've never had to tell them I was getting a better price anywhere else. They just give me a great quote each time, they have very fast and efficient customer service, and they give a 5 year warranty on everything they custom build for you. I'm a very satisfied customer. BTW, they seem to have saved a fair bit of money and passed the savings along to you and me, by designing their website in one whole afternoon. :) Yeah, it's ugly, but it works well enough to get an idea what you want and then ask for a quote. We got our 8 core opteron 32 Gig ram areca 1680i and 16 15k4 Seagates for $11,000 or so at the end of summer. When we first plugged them in, one machine started hanging after 24-48 hours of steady load testing. CAlled them, they said it sounded like a RAID card and they'd ship me a new one next day business. Let me make it clear, I didn't tell them I thought it was the RAID card, or that I needed it the next day, they decided that's what it sounded like, and they shipped it to me, FOC. Arrived the next business day. Which was a monday. Between friday and monday I swapped the RAID controllers between the two machines, and after 36 hours of testing, the problem followed the RAID controller and the other machine locked up. Got the new controller, put it in and ran a week long test, no failures. We have since had one drive fail (each machine has two hot spares) and they cross shipped the replacement and we got it in 3 days. Few places have made me so happy with a decision to spend $22k on servers as these guys have.
On Sat, 2008-12-13 at 12:57 -0700, Scott Marlowe wrote: > On Sat, Dec 13, 2008 at 12:47 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > > On Sat, 2008-12-13 at 12:45 -0700, Scott Marlowe wrote: > >> On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote: > >> > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote: > >> >> On Sat, 13 Dec 2008, Robert Haas wrote: > > Note: HP can beat Dell, every time on an apples to apples quote. At > > least when I have done it. > > Aberdeen inc, the one I listed in my previous response. Sorry didn't see it. > Their on site > price for a single quad core xeon, 8 gig 800MHZ ram and 24 15k5 SAS > drives is $10,080 or so. I've never had to tell them I was getting a > better price anywhere else. They just give me a great quote each > time, they have very fast and efficient customer service, and they > give a 5 year warranty on everything they custom build for you. I'm a > very satisfied customer. Well that would work for CMD but CMDs customer's would say, "Who the heck is Aberdeen?" > Few places have made me so happy with a decision to spend $22k on > servers as these guys have. > Well that is definitely a solid testimony. I don't have any experience with them but I know that the smaller companies always provide better service so I have no reason to doubt you. The larger the company gets the harder it is to get through the muck of bureaucracy. I have to be honest though, unless the customer explicitly states we don't have a problem with white box, its going to be HP, DELL or IBM, occasionally SUN but once they see how ridiculous the initial quotes from Sun are they generally don't want to deal with them anymore. Sincerely, Joshua D. Drake -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
El Sábado, 13 de Diciembre de 2008 Scott Marlowe escribió: > On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote: > > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote: > >> On Sat, 13 Dec 2008, Robert Haas wrote: > > > >> > This may be a little off-topic, but I'd be interested in hearing more > >> > details about how you (or others) would do this... manufacturer, > >> > model, configuration? How many hard drives do you need to get 25 > >> > spindles? And where can you get that many 15K hard drives for under > >> > $10K? My lack of experience in this area is showing here, but, > >> > seriously, any suggestions appreciated. > >> > > > > http://h71016.www7.hp.com/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2570&LowBaseId=15222&LowPrice=$1,899.00&familyviewgroup=757&viewtype=Matrix > > > > Retail cost, 15k, 36GB drives, ~ 12k. A phone call and threat of buying > > dell gets its for ~ 10k. > > I prefer to deal with companies that I don't have to horse trade with > to get a good deal. You can threaten dell and get good deals, but if > you're not watching, and let your boss handle an order suddenly you've > got an $800 Perc 5e on your hands. > are Perc 5 bad cards? is that for being sata? And what about perc 4 ? they are scsi Thanks in advance -- No imprima este correo si no es necesario, necesitará el dinero del papel para pagar las charlas del señor Gore. ->>----------------------------------------------- Clist UAH a.k.a Angel ---------------------------------[www.uah.es]-<<-- "Los politicos y los pañales han de cambiarse cada cierto tiempo."
On Sat, Dec 13, 2008 at 1:03 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > On Sat, 2008-12-13 at 12:57 -0700, Scott Marlowe wrote: >> On Sat, Dec 13, 2008 at 12:47 PM, Joshua D. Drake <jd@commandprompt.com> wrote: >> > On Sat, 2008-12-13 at 12:45 -0700, Scott Marlowe wrote: >> >> On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote: >> >> > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote: >> >> >> On Sat, 13 Dec 2008, Robert Haas wrote: > >> > Note: HP can beat Dell, every time on an apples to apples quote. At >> > least when I have done it. >> >> Aberdeen inc, the one I listed in my previous response. > > Sorry didn't see it. > >> Their on site >> price for a single quad core xeon, 8 gig 800MHZ ram and 24 15k5 SAS >> drives is $10,080 or so. I've never had to tell them I was getting a >> better price anywhere else. They just give me a great quote each >> time, they have very fast and efficient customer service, and they >> give a 5 year warranty on everything they custom build for you. I'm a >> very satisfied customer. > > Well that would work for CMD but CMDs customer's would say, "Who the > heck is Aberdeen?" For me, it's the company I recommend. Here's the point I make. If there's any problem, whatsoever, with aberdeen equipment when I'm putting it in, I'll come fix it and interface. If you pick someone else, YOU get to interface and replace parts, I'm not doing it. Their time versus my time. The same is true with anything else I that I recommend. >> Few places have made me so happy with a decision to spend $22k on >> servers as these guys have. > > Well that is definitely a solid testimony. I don't have any experience > with them but I know that the smaller companies always provide better > service so I have no reason to doubt you. The larger the company gets > the harder it is to get through the muck of bureaucracy. > > I have to be honest though, unless the customer explicitly states we > don't have a problem with white box, its going to be HP, DELL or IBM, > occasionally SUN but once they see how ridiculous the initial quotes > from Sun are they generally don't want to deal with them anymore. I've made it quite clear to bosses in the past that if they stick me with dud hardware (i.e. dell 2650s with adaptec RAID controllers, et. al.) that I will not be handling the long, drudging fight with the vendor. I'll just sell it on ebay if it won't work and we'll order what I said the first time. One or two times of that happening and they don't second guess me much anymore. My current boss is much more progressive, and while he had his doubts, he too is now sold. Isn't it amazing how many small businesses won't buy from other small businesses? They'd much rather give their money to a company they don't like because "they'll be around a while" (the big company).
On Sat, 2008-12-13 at 19:16 -0700, Scott Marlowe wrote: > Isn't it amazing how many small businesses won't buy from other small > businesses? They'd much rather give their money to a company they > don't like because "they'll be around a while" (the big company). > True enough! Joshua D. Drake -- PostgreSQL Consulting, Development, Support, Training 503-667-4564 - http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
"Scott Marlowe" <scott.marlowe@gmail.com> writes: > Isn't it amazing how many small businesses won't buy from other small > businesses? It's entertaining that Dell is considered one of the "safe choices" in this thread. They were a pretty small business not so long ago (and remain a lot smaller than IBM or HP) ... regards, tom lane