Thread: choosing the right platform
Hello all, I've been the lead developer of a successful (so-far) web application. Currently we run the database and the web application on the same server, but it will soon be necessary to split them and add some more web servers. I'm not counting on using replication any time soon, so I want to choose the hardware platform that will meet my needs for some time. Maybe you can give some suggestions... My application is written in PHP and relies on the apache web server. We use persistent db connections, so it appears (correct me if I'm wrong) that every apache child process gets one connection to the database. If my MaxClients is 150, each web server will be able to make up to 150 db connections. I'd like to play with that number a little bit once I get the webserver off of the db server. I feel that I could handle a greater number of Clients, so let us say that I have up to 200 connections per server. I'd like to have room to grow, so let's also say that I go to 20 web servers for a total of 4000 connections. (I'd probably like to go even higher, so consider this our starting point) With a couple dozen active accounts and a lot of test data, my current database is equiv to about 100 active accounts. Its current disk space consumption is: data # du --max-depth=2 3656 ./base/1 3656 ./base/16975 4292 ./base/95378 177824 ./base/200371 189432 ./base 144 ./global 82024 ./pg_xlog 2192 ./pg_clog 273836 . This is not optimized and there is a lot of old data, but to be safe, maybe we should assume that each account uses 4 MB of disk space in the db, counting indexes, tables and etc. I'd like to scale to 15,000 - 25,000 accounts, but I doubt that will be feasible at my budget level. (Also, there is a lot of optimizing to do, so it won't surprise me if this 4MB number is more like 2MB or even less) I'm not as concerned with disk subsystem or layout at the moment. I've seen a lot of good documentation (especially from Bruce Momjian, thanks!) on this subject. I'm mostly concerned with choosing the platform that's going to allow the scalability I need. Currently I'm most experienced in Linux, especially RedHat. I'm "certified" on SCO Openserver (5.x) and I've played with Irix, OSF/1 (I don't think it's called that anymore), Free BSD (3.x) and Solaris (2.x). I'm most comfortable with Linux, but I'm willing to use a different platform if it will be beneficial. I've heard that Solaris on the Sparc platform is capable of addressing larger amounts of RAM than Linux on Intel does. I don't know if that's true or if that has bearing, but I'd like to hear your opinions. My budget is going to be between (US) $5,000 and $10,000 and I'd like to stay under $7,000. I'm a major bargain hunter, so I shop e-bay a lot and here are some samplings that I think may be relevant for discussion: SUN (I'm not an expert in this, advice is requested) ---------------------------------------------------- SUN ENTERPRISE 4500 8x400 Mhz 4MB Cache CPUs 8GB RAM no hard drives ~$6,000 Sun E3500 - 8 x 336MHz 4MB Cache CPUs 4GB RAM 8 x 9.1GB FC disks ~$600.00 Any other suggestions? INTEL (I'm much more familiar with this area) ---------------------------------------------------- Compaq DL580 4x700 MHz 2MB Cache CPUs 4GB RAM (16GB Max) HW Raid w/ 64MB Cache ~$6000 IBM Netfinity 7100 4x500 MHz 1MB Cache CPUs up to (16GB Max) HW Raid Dell PowerEdge 8450 8x550 2M Cache CPUS 4GB (32GB Max) HS RAID w/ 16MB Cache ~$4,500 Any other suggestions? Any other hardware platforms I should consider? Finally, and I know this sounds silly, but I don't have my own data center, so size is something I need to take into consideration. I pay for data center space by the physical size of my servers. My priorities are Performance, Reasonable amount of scalability (as outlined above) and finally physical size. Thanks for taking the time to read this and for any assistance you can give, Matthew Nuzum www.bearfruit.org
Matthew, > Currently I'm most experienced in Linux, especially RedHat. I'm > "certified" on SCO Openserver (5.x) and I've played with Irix, OSF/1 (I > don't think it's called that anymore), Free BSD (3.x) and Solaris (2.x). > I'm most > comfortable with Linux, but I'm willing to use a different platform if it > will be beneficial. I've heard that Solaris on the Sparc platform is > capable of addressing larger amounts of RAM than Linux on Intel does. I > don't know if that's true or if that has bearing, but I'd like to hear your > opinions. Please browse through the list archives. We have numerous posts on the platform subject. In fact, several of us are trying to put together a PostgreSQL performance test package to answer this question difinitively rather than anecdotally. Anecdotal responses are: Solaris is *bad* for PostgreSQL, due to a high per-process overhead. Universal opinion on this list has been that Solaris is optimized for multi-threaded applications, not multi-process applications, and postgres is the latter. *BSD has a *lot* of fans on the PGSQL lists, many of whom claim significantly better performance than Linux, mostly due to better filesystem I/O. Linux is used heavily by a lot of PostgreSQL users. I have yet to see anyone provide actual Linux vs. BSD statistics, though ... something we hope to do. Nobody has come forward and reported on PostgreSQL on SCO Unix. Irix is widely regarded as a "dead" platform, though PostgreSQL does run on it ... Good luck, and keep watching this space! -- Josh Berkus Aglio Database Solutions San Francisco
I would say up front that both Linux and BSD are probably your two best choices. If you're familiar with one more than the other, that familiarity may be more important than the underlying differences in the two OSes, as they are both good platforms to run postgresql on top of. Secondly, look carefully at using persistant connections in large numbers. While persistant connections DO represent a big savings in connect time, the savings are lost in the noise of many PHP applications. i.e. my dual PIII swiss army knife server can initiate single persistant connections at 1,000,000 a second (reusing the old ones of course). non-persistant connects happen at 1,000 times a second. Most of my scripts run in 1/10th of a second or so, so the 1/1000th used to connect is noise to me. If you are going to use persistant connections, it might work better to let apache have only 20 or 40 children, which will force the apache children to "round robin" serve the requests coming in. This will usually work fine, since keeping the number of apache children down keeps the number of postgresql backends down, which keeps the system faster in terms of response time. Turn keep alive down to something short like 10 seconds, or just turn it off, as keep alive doesn't really save all that much time in apache. Note that machine testing with 100 simo connections doesn't translate directly to 100 users. Generally, x simos usually represents about 10 to 20 x users, since users don't click buttons all that fast. so an apache configured by 40 max children should handle 100 to 200 users with no problem. On Tue, 8 Apr 2003, Matthew Nuzum wrote: > Hello all, > > I've been the lead developer of a successful (so-far) web application. > Currently we run the database and the web application on the same server, > but it will soon be necessary to split them and add some more web servers. > > I'm not counting on using replication any time soon, so I want to choose the > hardware platform that will meet my needs for some time. Maybe you can give > some suggestions... > > My application is written in PHP and relies on the apache web server. We > use persistent db connections, so it appears (correct me if I'm wrong) that > every apache child process gets one connection to the database. If my > MaxClients is 150, each web server will be able to make up to 150 db > connections. I'd like to play with that number a little bit once I get the > webserver off of the db server. I feel that I could handle a greater number > of Clients, so let us say that I have up to 200 connections per server. > > I'd like to have room to grow, so let's also say that I go to 20 web servers > for a total of 4000 connections. (I'd probably like to go even higher, so > consider this our starting point) > > With a couple dozen active accounts and a lot of test data, my current > database is equiv to about 100 active accounts. Its current disk space > consumption is: > data # du --max-depth=2 > 3656 ./base/1 > 3656 ./base/16975 > 4292 ./base/95378 > 177824 ./base/200371 > 189432 ./base > 144 ./global > 82024 ./pg_xlog > 2192 ./pg_clog > 273836 . > > This is not optimized and there is a lot of old data, but to be safe, maybe > we should assume that each account uses 4 MB of disk space in the db, > counting indexes, tables and etc. I'd like to scale to 15,000 - 25,000 > accounts, but I doubt that will be feasible at my budget level. (Also, > there is a lot of optimizing to do, so it won't surprise me if this 4MB > number is more like 2MB or even less) > > I'm not as concerned with disk subsystem or layout at the moment. I've seen > a lot of good documentation (especially from Bruce Momjian, thanks!) on this > subject. I'm mostly concerned with choosing the platform that's going to > allow the scalability I need. > > Currently I'm most experienced in Linux, especially RedHat. I'm "certified" > on SCO Openserver (5.x) and I've played with Irix, OSF/1 (I don't think it's > called that anymore), Free BSD (3.x) and Solaris (2.x). I'm most > comfortable with Linux, but I'm willing to use a different platform if it > will be beneficial. I've heard that Solaris on the Sparc platform is > capable of addressing larger amounts of RAM than Linux on Intel does. I > don't know if that's true or if that has bearing, but I'd like to hear your > opinions. > > My budget is going to be between (US) $5,000 and $10,000 and I'd like to > stay under $7,000. I'm a major bargain hunter, so I shop e-bay a lot and > here are some samplings that I think may be relevant for discussion: > > SUN (I'm not an expert in this, advice is requested) > ---------------------------------------------------- > SUN ENTERPRISE 4500 8x400 Mhz 4MB Cache CPUs 8GB RAM no hard drives ~$6,000 > Sun E3500 - 8 x 336MHz 4MB Cache CPUs 4GB RAM 8 x 9.1GB FC disks ~$600.00 > Any other suggestions? > > INTEL (I'm much more familiar with this area) > ---------------------------------------------------- > Compaq DL580 4x700 MHz 2MB Cache CPUs 4GB RAM (16GB Max) HW Raid w/ 64MB > Cache ~$6000 > IBM Netfinity 7100 4x500 MHz 1MB Cache CPUs up to (16GB Max) HW Raid > Dell PowerEdge 8450 8x550 2M Cache CPUS 4GB (32GB Max) HS RAID w/ 16MB Cache > ~$4,500 > Any other suggestions? > > Any other hardware platforms I should consider? > > Finally, and I know this sounds silly, but I don't have my own data center, > so size is something I need to take into consideration. I pay for data > center space by the physical size of my servers. My priorities are > Performance, Reasonable amount of scalability (as outlined above) and > finally physical size. > > Thanks for taking the time to read this and for any assistance you can give, > > Matthew Nuzum > www.bearfruit.org > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
> Anecdotal responses are: > > Solaris is *bad* for PostgreSQL, due to a high per-process overhead. > Universal opinion on this list has been that Solaris is optimized for > multi-threaded applications, not multi-process applications, and postgres > is > the latter. > > *BSD has a *lot* of fans on the PGSQL lists, many of whom claim > significantly > better performance than Linux, mostly due to better filesystem I/O. > > Linux is used heavily by a lot of PostgreSQL users. I have yet to see > anyone > provide actual Linux vs. BSD statistics, though ... something we hope to > do. ... > -- > Josh Berkus > Aglio Database Solutions > San Francisco Thanks for the reply. Three things come to mind: About the list archives... I read through the entire archive at http://archives.postgresql.org/pgsql-performance/ and didn't see much talk on the subject. It only goes back 8 months though, so I don't know if there is another archive that is more comprehensive... Also, I'm glad to hear your comments about Solaris, I'm really most comfortable with Linux and I think I can pick up BSD pretty easily. About the Intel platform though, It's only been pretty recently (relatively speaking) that servers based on IA32 architecture have had support for greater than 2GB of RAM. I've heard talk about problems with applications that require more than 2GB. I do believe that my tables will become larger than this, and the way I understand it, sort mem works best when the tables can be loaded completely in RAM. I don't suspect that individual tables will be 2GB, but that the size of all tables combined will be. If there is a limitation on the largest chunk of RAM allocated to a program, will I have problems? Finally, can someone suggest a *BSD to evaluate? FreeBSD 4.8? 5.0? Is Apple a good choice? (I've heard it's based on BSD Unix) Thanks, -- Matthew Nuzum www.bearfruit.org cobalt@bearfruit.org
> Finally, can someone suggest a *BSD to evaluate? FreeBSD 4.8? 5.0? Is Apple > a good choice? (I've heard it's based on BSD Unix) I wouldn't recommend OSX for deployment if you're worried about performance. The hardware availiable and the settings totake advantage of it just aren't there yet, compared to the more established FreeBSD and Linux offerings. Development on a tibook is another matter, I'd recommend it to anyone with an attraction to shiny things that do real work. eric
On Wed, 9 Apr 2003, Matthew Nuzum wrote: > I'm glad to hear your comments about Solaris, I'm really most comfortable > with Linux and I think I can pick up BSD pretty easily. > > About the Intel platform though, > > It's only been pretty recently (relatively speaking) that servers based on > IA32 architecture have had support for greater than 2GB of RAM. I've heard > talk about problems with applications that require more than 2GB. I do > believe that my tables will become larger than this, and the way I > understand it, sort mem works best when the tables can be loaded completely > in RAM. > > I don't suspect that individual tables will be 2GB, but that the size of all > tables combined will be. If there is a limitation on the largest chunk of > RAM allocated to a program, will I have problems? A couple more suggestions. One is to never allocate more than 50% of your memory to a database's shared buffers, i.e. let the OS buffer the disks en masse, while the database should have a smaller buffer for the most recent accesses. This is because kernel caching is usually faster and more efficient than the database doing it, and this becomes more an issue with large chunks of memory, which both Linux and BSD are quite good at caching, and postgresql, not so good. The other is to look at Linux or BSD on 64 bit hardware (Sparc, IBM Zseries mainframes, SGI Altix, etc...) where the one thing that's worth being on the bleeding edge for is databases and their memory hungry ways. :-)
Matthew, > I read through the entire archive at > http://archives.postgresql.org/pgsql-performance/ and didn't see much talk > on the subject. It only goes back 8 months though, so I don't know if there > is another archive that is more comprehensive... Really? There was a long-running Mac OS X vs. Solaris thread that touched on most major platforms, about 2-3 months ago. > I don't suspect that individual tables will be 2GB, but that the size of all > tables combined will be. If there is a limitation on the largest chunk of > RAM allocated to a program, will I have problems? No. Since PostgreSQL is a multi-process architecture, not a multi-threaded, you only need enough RAM per process to load the current largest query. Plus, in my experience, Disk I/O issues are vastly more important than RAM in database performance. You're better off spending money on really fast disks in Linux RAID or really good hardware RAID 1+0 .... -- -Josh Berkus Aglio Database Solutions San Francisco
On Wed, Apr 09, 2003 at 01:13:46PM -0400, Matthew Nuzum wrote: > Finally, can someone suggest a *BSD to evaluate? FreeBSD 4.8? 5.0? Is Apple > a good choice? (I've heard it's based on BSD Unix) FreeBSD has 3 different branches: -current: This is bleeding edge. Definitely need to be careful with this one, and it's not recommended for production. -stable: This is still a 'live' branch that any FBSD coder can (generally) commit to, but they are far more careful about breaking this branch. Not as stable as a release branch, but it's probably suitable for production so long as you're careful to test things. release branches: Every time an official release is done (ie: 4.8), a branch is created. The only code committed to these branches are security patches and fixes for very serious bugs. These branches are extremely stable. 5.0 is the first release after several years of development in -current. It incorporates some major changes designed to allow the kernel to run multi-threaded. However, unlike what usually happens, 5.0 is not considered to be -stable yet. First, this is still very new code; second, I believe there's some performance issues that are still being addressed. The intention is that 5.1 will be the first -stable release of the 5.x code. Because you're looking for something that's production ready, you probably want 4.8 (cvs tag RELENG_4_8). However, if you don't plan to hit production until late this year (when 5.1 should be out), you might want to try 5.0. Far more info is available at http://www.freebsd.org/releng/index.html BTW, I've heard of many, many companies moving their Oracle installs from Sun to RS/6000 because RS/6000's typically need 1/2 the processors that Sun does for a given load. If you're going to look at big-iron, RS/6000 is definitely worth a look if you see anything. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Wed, Apr 09, 2003 at 11:55:56AM -0600, scott.marlowe wrote: > A couple more suggestions. One is to never allocate more than 50% of your > memory to a database's shared buffers, i.e. let the OS buffer the disks en > masse, while the database should have a smaller buffer for the most recent > accesses. This is because kernel caching is usually faster and more > efficient than the database doing it, and this becomes more an issue with > large chunks of memory, which both Linux and BSD are quite good at > caching, and postgresql, not so good. That seems odd... shouldn't pgsql be able to cache information better since it would be cached in whatever format is best for it, rather than the raw page format (or maybe that is the best format). There's also the issue of having to go through more layers of software if you're relying on the OS caching. All the tuning info I've seen for every other database I've worked with specifically recommends giving the database as much memory as you possibly can, the theory being that it will do a much better job of caching than the OS will. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Wed, Apr 09, 2003 at 10:51:34AM -0600, scott.marlowe wrote: > Secondly, look carefully at using persistant connections in large numbers. > > While persistant connections DO represent a big savings in connect time, > the savings are lost in the noise of many PHP applications. > > i.e. my dual PIII swiss army knife server can initiate single persistant > connections at 1,000,000 a second (reusing the old ones of course). > non-persistant connects happen at 1,000 times a second. Most of my > scripts run in 1/10th of a second or so, so the 1/1000th used to connect > is noise to me. My $0.02 from my experience with Sybase and DB2: It's not the connection *time* that's an issue, it's the amount of resources (mostly memory) used by each database connection. Each db2 connection to a database uses 4-8 meg of memory; on my pgsql system, each connection appears to be using about 4M. This is the resident set, which I believe indicates memory that basically can't be shared. All this memory is memory that can't be used for buffering/caching; on a system with a hundred connections, it can really start to add up. If your PHP is written in such a way that it does all the database work in one section of code, and only holds a connection to the database in that one section, then you can potentially have a lot of apache processes for each database connection. Of course, all this holds true wether you're using pooling or not. How much pooling will help depends on how expensive it is for the *database* to handle each new connection request, and how your code is written. Since it's often not possible to put all you database code in one place like I mentioned above, an alternative is to connect right before you do an operation, and disconnect as soon as you're done. This doesn't add much (if any) expense if you're using pooling, but it's a very different story if you're not using pooling. > If you are going to use persistant connections, it might work better to > let apache have only 20 or 40 children, which will force the apache > children to "round robin" serve the requests coming in. > > This will usually work fine, since keeping the number of apache children > down keeps the number of postgresql backends down, which keeps the system > faster in terms of response time. Turn keep alive down to something short > like 10 seconds, or just turn it off, as keep alive doesn't really save > all that much time in apache. Very important advice. Generally, once you push a database past a certain point, your performance degrades severely as the database thrashes about trying to answer all the pending queries. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > That seems odd... shouldn't pgsql be able to cache information better > since it would be cached in whatever format is best for it, rather than > the raw page format (or maybe that is the best format). There's also the > issue of having to go through more layers of software if you're relying > on the OS caching. All the tuning info I've seen for every other > database I've worked with specifically recommends giving the database as > much memory as you possibly can, the theory being that it will do a much > better job of caching than the OS will. There are a number of reasons why that's a dubious policy for PG (I won't take a position on whether these apply to other databases...) One is that because we sit on top of the OS' filesystem, we can't (portably) prevent the OS from caching blocks. So it's quite easy to get into a situation where the same data is cached twice, once in PG buffers and once in kernel disk cache. That's clearly a waste of RAM however you slice it, and it's worst when you set the PG shared buffer size to be about half of available RAM. You can minimize the duplication by skewing the allocation one way or the other: either set PG's allocation relatively small, relying heavily on the OS to do the caching; or make PG's allocation most of RAM and hope to squeeze out the OS' cache. There are partisans for both approaches on this list. I lean towards the first policy because I think that starving the kernel for RAM is a bad idea. (Especially if you run on Linux, where this policy tempts the kernel to start kill -9'ing random processes ...) Another reason is that PG uses a simplistic fixed-number-of-buffers internal cache, and therefore it can't adapt on-the-fly to varying memory pressure, whereas the kernel can and will give up disk cache space to make room when it's needed for processes. Since PG isn't even aware of the total memory pressure on the system as a whole, it couldn't do as good a job of trading off cache vs process workspace as the kernel can do, even if we had a variable-size cache scheme. A third reason is that on many (most?) Unixen, SysV shared memory is subject to swapping, and the bigger you make the shared_buffer arena, the more likely it gets that some of the arena will be touched seldom enough to make it a candidate for swapping. A disk buffer that gets swapped to disk is worse than useless (if it's dirty, the swapping is downright counterproductive, since an extra read and write cycle will be needed before the data can make it to its rightful place). PG is *not* any smarter about the usage patterns of its disk buffers than the kernel is; it uses a simple LRU algorithm that is surely no brighter than what the kernel uses. (We have looked at smarter buffer recycling rules, but failed to see any performance improvement.) So the notion that PG can do a better job of cache management than the kernel is really illusory. About the only advantage you gain from having data directly in PG buffers rather than kernel buffers is saving the CPU effort needed to move data across the userspace boundary --- which is not zero, but it's sure a lot less than the time spent for actual I/O. So my take on it is that you want shared_buffers fairly small, and let the kernel do the bulk of the heavy lifting for disk cache. That's what it does for a living, so let it do what it does best. You only want shared_buffers big enough so you don't spend too many CPU cycles shoving data back and forth between PG buffers and kernel disk cache. The default shared_buffers setting of 64 is surely too small :-(, but my feeling is that values in the low thousands are enough to get past the knee of that curve in most cases. regards, tom lane
Thanks for all the feedback, this is very informative. My current issues that I'm still not clear on, are: * Is the ia32 architecture going to impose uncomfortable limits on my application? I'm seeing lots of confirmation that this platform, regardless of the OS is going to limit me to less the 4GB of memory allocated to a single application (i.e. http://www.spack.org/index.cgi/LinuxRamLimits). This may or may not be an issue because: (note that these are questions, not statements) ** Postgres is multi-process, not multi-threaded (?) ** It's better to not use huge amount of sort-mem but instead let the OS do the caching (?) ** My needs are really not going to be as big as I think they are if I manage the application/environment correctly (?) Here are some of the performance suggestions I've heard, please, if I mis-understood, could you help me get clarity? * It's better to run fewer apache children and turn off persistent connections (I had suggested 200 children per server, someone else suggested 40) * FreeBSD is going to provide a better file system than Linux (because Linux only supports large files on journaling filesystems which impose extra over head) (this gleaned from this conversation and previous threads in archives) * Running Linux or *BSD on a 64 bit platform can alleviate some potential RAM limitations (if there are truly going to be limitations). If this is so, I've heard suggestions for Itanium, Sparc and RS/6000. Maybe someone can give some more info on these, here are my immediate thoughts: I've heard that the industry as a whole has not yet warmed up to Itanium. I can't afford the newest Sparc Servers, so I'd need to settle with a previous generation if I went that route, any problems with that? I know nothing about the RS/6000 servers (I did see one once though :-), does linux|*BSD run well on them and any suggestions for series/models I should look at? Finally, some specific questions, What's the max number of connections someone has seen on a database server? What type of hardware was it? How much RAM did postgres use? Thanks again, -- Matthew Nuzum www.bearfruit.org cobalt@bearfruit.org > -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance- > owner@postgresql.org] On Behalf Of Tom Lane > Sent: Wednesday, April 09, 2003 8:21 PM > To: jim@nasby.net > Cc: scott.marlowe; Matthew Nuzum; 'Josh Berkus'; 'Pgsql-Performance' > Subject: Caching (was Re: [PERFORM] choosing the right platform) > > "Jim C. Nasby" <jim@nasby.net> writes: > > That seems odd... shouldn't pgsql be able to cache information better > > since it would be cached in whatever format is best for it, rather than > > the raw page format (or maybe that is the best format). There's also the > > issue of having to go through more layers of software if you're relying > > on the OS caching. All the tuning info I've seen for every other > > database I've worked with specifically recommends giving the database as > > much memory as you possibly can, the theory being that it will do a much > > better job of caching than the OS will. > > There are a number of reasons why that's a dubious policy for PG (I > won't take a position on whether these apply to other databases...) > > One is that because we sit on top of the OS' filesystem, we can't > (portably) prevent the OS from caching blocks. So it's quite easy to > get into a situation where the same data is cached twice, once in PG > buffers and once in kernel disk cache. That's clearly a waste of RAM > however you slice it, and it's worst when you set the PG shared buffer > size to be about half of available RAM. You can minimize the > duplication by skewing the allocation one way or the other: either set > PG's allocation relatively small, relying heavily on the OS to do the > caching; or make PG's allocation most of RAM and hope to squeeze out > the OS' cache. There are partisans for both approaches on this list. > I lean towards the first policy because I think that starving the kernel > for RAM is a bad idea. (Especially if you run on Linux, where this > policy tempts the kernel to start kill -9'ing random processes ...) > > Another reason is that PG uses a simplistic fixed-number-of-buffers > internal cache, and therefore it can't adapt on-the-fly to varying > memory pressure, whereas the kernel can and will give up disk cache > space to make room when it's needed for processes. Since PG isn't > even aware of the total memory pressure on the system as a whole, > it couldn't do as good a job of trading off cache vs process workspace > as the kernel can do, even if we had a variable-size cache scheme. > > A third reason is that on many (most?) Unixen, SysV shared memory is > subject to swapping, and the bigger you make the shared_buffer arena, > the more likely it gets that some of the arena will be touched seldom > enough to make it a candidate for swapping. A disk buffer that gets > swapped to disk is worse than useless (if it's dirty, the swapping > is downright counterproductive, since an extra read and write cycle > will be needed before the data can make it to its rightful place). > > PG is *not* any smarter about the usage patterns of its disk buffers > than the kernel is; it uses a simple LRU algorithm that is surely no > brighter than what the kernel uses. (We have looked at smarter buffer > recycling rules, but failed to see any performance improvement.) So the > notion that PG can do a better job of cache management than the kernel > is really illusory. About the only advantage you gain from having data > directly in PG buffers rather than kernel buffers is saving the CPU > effort needed to move data across the userspace boundary --- which is > not zero, but it's sure a lot less than the time spent for actual I/O. > > So my take on it is that you want shared_buffers fairly small, and let > the kernel do the bulk of the heavy lifting for disk cache. That's what > it does for a living, so let it do what it does best. You only want > shared_buffers big enough so you don't spend too many CPU cycles shoving > data back and forth between PG buffers and kernel disk cache. The > default shared_buffers setting of 64 is surely too small :-(, but my > feeling is that values in the low thousands are enough to get past the > knee of that curve in most cases. > > regards, tom lane > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
Tom, What appends when PG scans a table that is is too big to fit in the cache? Won't the whole cache get trashed and swapped off to disk? Shouldn't there be a way to lock some tables in PG cache? Who about caracterizing some of the RAM like: scan, index, small frequently used tables. JLL Tom Lane wrote: > [...] > PG is *not* any smarter about the usage patterns of its disk buffers > than the kernel is; it uses a simple LRU algorithm that is surely no > brighter than what the kernel uses. (We have looked at smarter buffer > recycling rules, but failed to see any performance improvement.) So the > notion that PG can do a better job of cache management than the kernel > is really illusory. About the only advantage you gain from having data > directly in PG buffers rather than kernel buffers is saving the CPU > effort needed to move data across the userspace boundary --- which is > not zero, but it's sure a lot less than the time spent for actual I/O. > > So my take on it is that you want shared_buffers fairly small, and let > the kernel do the bulk of the heavy lifting for disk cache. That's what > it does for a living, so let it do what it does best. You only want > shared_buffers big enough so you don't spend too many CPU cycles shoving > data back and forth between PG buffers and kernel disk cache. The > default shared_buffers setting of 64 is surely too small :-(, but my > feeling is that values in the low thousands are enough to get past the > knee of that curve in most cases.
Jean-Luc Lachance <jllachan@nsd.ca> writes: > Shouldn't there be a way to lock some tables in PG cache? In my opinion, no. I do not think a manual locking feature could possibly be used effectively. It could very easily be abused to decrease net performance, though :-( It does seem that a smarter buffer management algorithm would be a good idea, but past experiments have failed to show any measurable benefit. Perhaps those experiments were testing the wrong conditions. I'd still be happy to see LRU(k) or some such method put in, if someone can prove that it actually does anything useful for us. (As best I recall, I only tested LRU-2 with pgbench. Perhaps Josh's benchmarking project will offer a wider variety of interesting scenarios.) regards, tom lane
On Wed, 9 Apr 2003, Jim C. Nasby wrote: > On Wed, Apr 09, 2003 at 10:51:34AM -0600, scott.marlowe wrote: > > Secondly, look carefully at using persistant connections in large numbers. > > > > While persistant connections DO represent a big savings in connect time, > > the savings are lost in the noise of many PHP applications. > > > > i.e. my dual PIII swiss army knife server can initiate single persistant > > connections at 1,000,000 a second (reusing the old ones of course). > > non-persistant connects happen at 1,000 times a second. Most of my > > scripts run in 1/10th of a second or so, so the 1/1000th used to connect > > is noise to me. > > My $0.02 from my experience with Sybase and DB2: > It's not the connection *time* that's an issue, it's the amount of > resources (mostly memory) used by each database connection. Each db2 > connection to a database uses 4-8 meg of memory; Agreed. > on my pgsql system, > each connection appears to be using about 4M. This is the resident set, > which I believe indicates memory that basically can't be shared. All > this memory is memory that can't be used for buffering/caching; on a > system with a hundred connections, it can really start to add up. If I run "select * from logs" from two different psql sessions on my backup box hitting my main box (psql would hold the result set and throw the results off if I ran it on the main box) I get this output from top: No (pgsql) load: 8:58am up 9 days, 22:43, 4 users, load average: 0.65, 0.54, 0.35 169 processes: 168 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 0.1% user, 0.1% system, 0.0% nice, 99.1% idle CPU1 states: 32.1% user, 3.2% system, 0.0% nice, 64.0% idle Mem: 1543980K av, 1049864K used, 494116K free, 265928K shrd, 31404K buff Swap: 2048208K av, 0K used, 2048208K free 568600K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 10241 postgres 9 0 4216 4216 4136 S 0.0 0.2 0:05 postmaster 10242 postgres 9 0 4444 4444 4156 S 0.0 0.2 0:00 postmaster 10243 postgres 9 0 4812 4812 4148 S 0.0 0.3 0:00 postmaster 1 psql select *: 9:03am up 9 days, 22:48, 2 users, load average: 0.71, 0.71, 0.46 166 processes: 165 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle CPU1 states: 0.1% user, 2.0% system, 0.0% nice, 97.3% idle Mem: 1543980K av, 1052188K used, 491792K free, 265928K shrd, 32036K buff Swap: 2048208K av, 0K used, 2048208K free 570656K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 10241 postgres 10 0 4216 4216 4136 S 0.0 0.2 0:05 postmaster 10242 postgres 9 0 4448 4448 4156 S 0.0 0.2 0:00 postmaster 10243 postgres 9 0 4812 4812 4148 S 0.0 0.3 0:00 postmaster 18026 postgres 9 0 236M 236M 235M S 0.0 15.6 0:12 postmaster 18035 postgres 10 0 5832 5732 5096 S 0.0 0.3 0:00 postmaster 2 psql select *: 9:03am up 9 days, 22:49, 2 users, load average: 0.58, 0.66, 0.45 166 processes: 165 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 0.0% user, 2.2% system, 0.0% nice, 97.2% idle CPU1 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle Mem: 1543980K av, 1053152K used, 490828K free, 265928K shrd, 32112K buff Swap: 2048208K av, 0K used, 2048208K free 570684K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 10241 postgres 8 0 4216 4216 4136 S 0.0 0.2 0:05 postmaster 10242 postgres 9 0 4448 4448 4156 S 0.0 0.2 0:00 postmaster 10243 postgres 9 0 4812 4812 4148 S 0.0 0.3 0:00 postmaster 18026 postgres 9 0 236M 236M 235M S 0.0 15.6 0:12 postmaster 18035 postgres 9 0 236M 236M 235M S 0.0 15.6 0:12 postmaster The difference between SIZE and SHARE is the delta, which is only something like 3 or 4 megs for the initial select * from logs, but the second one is only 1 meg. On average, the actual increase in memory usage for postgresql isn't that great, usually about 1 meg. Running out of memory isn't really a problem with connections<=200 and 1 gig of ram, as long as sort_mem isn't too high. I/O contention is the killer at that point, as is CPU load.
On Wed, 9 Apr 2003, Jim C. Nasby wrote: > On Wed, Apr 09, 2003 at 11:55:56AM -0600, scott.marlowe wrote: > > A couple more suggestions. One is to never allocate more than 50% of your > > memory to a database's shared buffers, i.e. let the OS buffer the disks en > > masse, while the database should have a smaller buffer for the most recent > > accesses. This is because kernel caching is usually faster and more > > efficient than the database doing it, and this becomes more an issue with > > large chunks of memory, which both Linux and BSD are quite good at > > caching, and postgresql, not so good. > > That seems odd... shouldn't pgsql be able to cache information better > since it would be cached in whatever format is best for it, rather than > the raw page format (or maybe that is the best format). Yes and no. The problem isn't that the data is closer to postgresql in it's buffers versus further away in kernel buffers, it's that postgresql's caching algorhythm isn't performance tweaked for very large settings, it's performance tweaked to provide good performance on smaller machines, with say 4 or 16 Megs of shared buffers. Handling large buffers requires a different approach to handling small ones, and the kernel is optimized in that direction. Also, the kernel in most Oses, i.e. Linux and BSD tends to use "spare ram" with abandon as cache memory, so if you've got 4 gigs of ram, with 200 Megs set aside for postgresql, it's quite likely that the kernel cache can hold ALL your dataset for you once it's been read in once. So, the data is already cached once. Caching it again in Postgresql only gains a little, since the speed difference of postgresql shared buffer / cache and kernel caches is very small. However, the speed going to the hard drive is much slower. What you don't want is a postgresql cache that's bigger (on average) than the kernel cache, since the kernel cache will then be "thrashing" when you access information not currently in either cache. I.e. postgresql becomes your only cache, and kernel caching stops working for you and becomes just overhead, since you never get anything from it if it's too small to cache something long enough to be used again. > There's also the > issue of having to go through more layers of software if you're relying > on the OS caching. All the tuning info I've seen for every other > database I've worked with specifically recommends giving the database as > much memory as you possibly can, the theory being that it will do a much > better job of caching than the OS will. That's old school thinking. There was a day when kernel caching was much slower, and writing directly to your devices in a raw mode was the only way to ensure good performance. Nowadays, most modern Unix kernels and their file systems are a match for most database needs. heck, with some storage systems, the performance of the file system is just not really an issue, it's the bandwidth of the connector you use. Note that this is a good thing (TM) since it frees the postgresql development team to do other things than worry about caching 1 gig of data.
"scott.marlowe" <scott.marlowe@ihs.com> writes: > Note that this is a good thing (TM) since it frees the postgresql > development team to do other things than worry about caching 1 gig of > data. Yeah. I think this is one very fundamental difference of design philosophy between Postgres and more-traditional databases such as Oracle. We prefer to let the kernel and filesystem do their jobs, and we assume they will do them well; whereas Oracle wants to bypass if not replace the kernel and filesystem. Partly this is a matter of the PG project not having the manpower to replace those layers. But I believe the world has changed in the last twenty years, and the Oracle approach is now obsolete: it's now costing them design and maintenance effort that isn't providing much return. Modern kernels and filesystems are *good*, and it's not easy to do better. We should focus our efforts on functionality that doesn't just duplicate what the OS can do. This design approach shows up in other areas too. For instance, in another thread I was just pointing out that there is no need for our frontend/backend protocol to solve communication problems like dropped or duplicated packets; TCP does that perfectly well already. regards, tom lane
How can we solve the problem of cache trashing when scanning large tables? Tom Lane wrote: > > Jean-Luc Lachance <jllachan@nsd.ca> writes: > > Shouldn't there be a way to lock some tables in PG cache? > > In my opinion, no. I do not think a manual locking feature could > possibly be used effectively. It could very easily be abused to > decrease net performance, though :-( > > It does seem that a smarter buffer management algorithm would be a good > idea, but past experiments have failed to show any measurable benefit. > Perhaps those experiments were testing the wrong conditions. I'd still > be happy to see LRU(k) or some such method put in, if someone can prove > that it actually does anything useful for us. (As best I recall, I only > tested LRU-2 with pgbench. Perhaps Josh's benchmarking project will > offer a wider variety of interesting scenarios.) > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
On Thu, Apr 10, 2003 at 10:42:35AM -0600, scott.marlowe wrote: > The difference between SIZE and SHARE is the delta, which is only > something like 3 or 4 megs for the initial select * from logs, but the > second one is only 1 meg. On average, the actual increase in memory usage > for postgresql isn't that great, usually about 1 meg. > > Running out of memory isn't really a problem with connections<=200 and 1 > gig of ram, as long as sort_mem isn't too high. I/O contention is the > killer at that point, as is CPU load. Except you should consider what you could be doing with that 200M, ie: caching data. Even something as small as 1M per connection starts to add up. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"