Thread: hardware performance and some more
hello, some of my questions may not be related to this group however, I know that some of them are directly related to this list. first of all I would like to learn that, any of you use the postgresql within the clustered environment? Or, let me ask you the question, in different manner, can we use postgresql in a cluster environment? If we can do what is the support method of the postgresql for clusters? I would like to know two main clustering methods. (let us assume we use 2 machines in the clustering system) in the first case we have two machines running in a cluster however, the second one does not run the database server untill the observation of the failure of the first machine, the oracle guys call this situation as active-passive configuration. There is only one machine running the database server at the same time. Hence, in the case of failure there are some time to be waited untill the second machine comes up. In the second option both machines run the database server at the same time. Again oracle supports this method using some additional applications called Real Application Cluster (RAC). Again oracle guys call this method as active-active configuration. The questions for this explanation are: 1 - Can we use postgresql within clustered environment? 2 - if the answer is yes, in which method can we use postgresql within a cluster? active - passive or active - active? Now, the second question is related to the performance of the database. Assuming we have a dell's poweredge 6650 with 4 x 2.8 Ghz Xeon processors having 2 MB of cache for each, with the main memory of lets say 32 GB. We can either use a small SAN from EMC or we can put all disks into the machines with the required raid confiuration. We will install RedHat Advanced Server 2.1 to the machine as the operating system and postgresql as the database server. We have a database having 25 millions records having the length of 250 bytes on average for each record. And there are 1000 operators accessing the database concurrently. The main operation on the database (about 95%) is select rather than insert, so do you have any idea about the performance of the system? best regards, -kas�m
On 24 Jul 2003 at 15:54, Kasim Oztoprak wrote: > The questions for this explanation are: > 1 - Can we use postgresql within clustered environment? > 2 - if the answer is yes, in which method can we use postgresql within a cluster? > active - passive or active - active? Coupled with linux-HA( See http://linux-ha.org) heartbeat service, it *should* be possible to run postgresql in active-passive clustering. If postgresql supported read-only database so that several nodes could read off a single disk but only one could update that, a sort of active-active should be possible as well. But postgresql can not have a read only database. That would be a handy addition in such cases.. > Now, the second question is related to the performance of the database. Assuming we have a > dell's poweredge 6650 with 4 x 2.8 Ghz Xeon processors having 2 MB of cache for each, with the > main memory of lets say 32 GB. We can either use a small SAN from EMC or we can put all disks > into the machines with the required raid confiuration. > > We will install RedHat Advanced Server 2.1 to the machine as the operating system and postgresql as > the database server. We have a database having 25 millions records having the length of 250 bytes > on average for each record. And there are 1000 operators accessing the database concurrently. The main > operation on the database (about 95%) is select rather than insert, so do you have any idea about > the performance of the system? Assumig 325 bytes per tuple(250 bytes field+24-28 byte header+varchar fields) gives 25 tuples per 8K page, there would be 8GB of data. This configuration could fly with 12-16GB of RAM. After all data is read that is. You can cut down on other requirements as well. May be a 2x opteron with 16GB RAMmight be a better fit but check out how much CPU cache it has. A grep -rwn across data directory would fill the disk cache pretty well..:-) HTH Bye Shridhar -- Egotism, n: Doing the New York Times crossword puzzle with a pen.Egotist, n: A person of low taste, more interested in himself than me. -- Ambrose Bierce, "The Devil's Dictionary"
On 24 Jul 2003 17:08 EEST you wrote: > On 24 Jul 2003 at 15:54, Kasim Oztoprak wrote: > > > The questions for this explanation are: > > 1 - Can we use postgresql within clustered environment? > > 2 - if the answer is yes, in which method can we use postgresql within a cluster? > > active - passive or active - active? > > Coupled with linux-HA( See http://linux-ha.org) heartbeat service, it *should* > be possible to run postgresql in active-passive clustering. > > If postgresql supported read-only database so that several nodes could read off > a single disk but only one could update that, a sort of active-active should be > possible as well. But postgresql can not have a read only database. That would > be a handy addition in such cases.. > so in the master and slave configuration we can use the system within clustering environment. > > Now, the second question is related to the performance of the database. Assuming we have a > > dell's poweredge 6650 with 4 x 2.8 Ghz Xeon processors having 2 MB of cache for each, with the > > main memory of lets say 32 GB. We can either use a small SAN from EMC or we can put all disks > > into the machines with the required raid confiuration. > > > > We will install RedHat Advanced Server 2.1 to the machine as the operating system and postgresql as > > the database server. We have a database having 25 millions records having the length of 250 bytes > > on average for each record. And there are 1000 operators accessing the database concurrently. The main > > operation on the database (about 95%) is select rather than insert, so do you have any idea about > > the performance of the system? > > Assumig 325 bytes per tuple(250 bytes field 24-28 byte header varchar fields) > gives 25 tuples per 8K page, there would be 8GB of data. This configuration > could fly with 12-16GB of RAM. After all data is read that is. You can cut down > on other requirements as well. May be a 2x opteron with 16GB RAMmight be a > better fit but check out how much CPU cache it has. we do not have memory problem or disk problems. as I have seen in the list the best way to use disks are using raid 10 for data and raid 1 for os. we can put as much memory as we require. now the question, if we have 100 searches per second and in each search if we need 30 sql instruction, what will be the performance of the system in the order of time. Let us say we have two machines described aove in a cluster. > > A grep -rwn across data directory would fill the disk cache pretty well..:-) > > HTH > > Bye > Shridhar > > -- > Egotism, n: Doing the New York Times crossword puzzle with a pen.Egotist, n: A > person of low taste, more interested in himself than me. -- Ambrose Bierce, > "The Devil's Dictionary" > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> Now, the second question is related to the performance of the database. Assuming we have a > dell's poweredge 6650 with 4 x 2.8 Ghz Xeon processors having 2 MB of cache for each, with the > main memory of lets say 32 GB. We can either use a small SAN from EMC or we can put all disks > into the machines with the required raid confiuration. > > We will install RedHat Advanced Server 2.1 to the machine as the operating system and postgresql as > the database server. We have a database having 25 millions records having the length of 250 bytes > on average for each record. And there are 1000 operators accessing the database concurrently. The main > operation on the database (about 95%) is select rather than insert, so do you have any idea about > the performance of the system? I have a very similar installation: Dell PE6600 with dual 2.0 Xeons/2MB cache, 4 GB memory, 6-disk RAID-10 for data, 2-diskRAID-1 for RH Linux 8. My database has over 60 million records averaging 200 bytes per tuple. I have a large nightlydata load, then very complex multi-table join queries all day with a few INSERT transactions. While I do not have1000 concurrent users (more like 30 for me), my processors and disks seem to be idle the vast majority of the time -this machine is overkill. So I think you will have no problem with your hardware, and could probably easily get away withonly two processors. Someday, if you can determine with certainty that the CPU is a bottleneck, drop in the 3rd and4th processors (and $10,000). And save yourself money on the RAM as well - it's incredibly easy to put in more if youneed it. If you really want to spend money, set up the fastest disk arrays you can imagine. I cannot emphasize enough: allocate a big chunk of time for tuning your database and learning from this list. I migratedfrom Microsoft SQL Server. Out of the box PostgreSQL was horrible for me, and even after significant tuning it crawledon certain queries (compared to MSSQL). The list helped me find a data type mismatch in a JOIN clause, and sincethen the performance of PostgreSQL has blown the doors off of MSSQL. Since I only gave myself a couple days to do tuningbefore the db had to go in production, I almost had to abandon PostgreSQL and revert to MS. My problems were solvedin the nick of time, but I really wish I had made more time for tuning. Running strong in production for 7 months now with PostgreSQL 7.3, and eagerly awaiting 7.4! Roman Fail POS Portal, Inc.
On 24 Jul 2003 18:44 EEST you wrote: > > Now, the second question is related to the performance of the database. Assuming we have a > > dell's poweredge 6650 with 4 x 2.8 Ghz Xeon processors having 2 MB of cache for each, with the > > main memory of lets say 32 GB. We can either use a small SAN from EMC or we can put all disks > > into the machines with the required raid confiuration. > > > > We will install RedHat Advanced Server 2.1 to the machine as the operating system and postgresql as > > the database server. We have a database having 25 millions records having the length of 250 bytes > > on average for each record. And there are 1000 operators accessing the database concurrently. The main > > operation on the database (about 95%) is select rather than insert, so do you have any idea about > > the performance of the system? > > I have a very similar installation: Dell PE6600 with dual 2.0 Xeons/2MB cache, 4 GB memory, 6-disk RAID-10 for data, 2-diskRAID-1 for RH Linux 8. My database has over 60 million records averaging 200 bytes per tuple. I have a large nightlydata load, then very complex multi-table join queries all day with a few INSERT transactions. While I do not have1000 concurrent users (more like 30 for me), my processors and disks seem to be idle the vast majority of the time -this machine is overkill. So I think you will have no problem with your hardware, and could probably easily get away withonly two processors. Someday, if you can determine with certainty that the CPU is a bottleneck, drop in the 3rd and4th processors (and $10,000). And save yourself money on the RAM as well - it's incredibly easy to put in more if youneed it. If you really want to spend money, set up the fastest disk arrays you can imagine. > i have some time for the production, therefore, i can wait for the beta and production of version 7.4. as i have seeen from your comments, you have 30 clients reaching to the database. assuming the maximum number of search for each client is 5 then, search per second will be atmost 3. in my case, there will be around 100 search per second. so the main bothleneck comes from there. and finally, the rate for the insert operation is about %0.1 (1 in every thousand). I've started to learn about my limitations a few days ago, i would like to learn whether i can solve my problem with postgresql or not. > I cannot emphasize enough: allocate a big chunk of time for tuning your database and learning from this list. I migratedfrom Microsoft SQL Server. Out of the box PostgreSQL was horrible for me, and even after significant tuning it crawledon certain queries (compared to MSSQL). The list helped me find a data type mismatch in a JOIN clause, and sincethen the performance of PostgreSQL has blown the doors off of MSSQL. Since I only gave myself a couple days to do tuningbefore the db had to go in production, I almost had to abandon PostgreSQL and revert to MS. My problems were solvedin the nick of time, but I really wish I had made more time for tuning. > > Running strong in production for 7 months now with PostgreSQL 7.3, and eagerly awaiting 7.4! > > Roman Fail > POS Portal, Inc. > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match
| first of all I would like to learn that, any of you use the postgresql | within the clustered environment? Or, let me ask you the question, in | different manner, can we use postgresql in a cluster environment? If | we can do what is the support method of the postgresql for clusters? You could do active-active but it would require work on your end. I did a recent check on all the Postgres replication packages and they all seem to be single master -> single/many slaves. Updating on more than 1 server looks to be problematic. I run an active-active now but I had to develop my own custom replication strategy. As a background, we develop & host web-based apps that use Postgres as the DB engine. Since our clients access our server over the internet, uptime is a big issue. Hence, we have two server farms: one colocated in San Francisco and the other in Sterling, VA. In addition to redudancy, we also wanted to spread the load across the servers. To do this, we went with the expedient method of 1-minute DNS zonemaps where if both servers are up, 70% traffic is sent to the faster farm and 30% to the other. Both servers are constantly monitored and if one goes down, a new zonemap is pushed out listing only the servers that are up. The first step in making this work was converting all integer keys to character keys. By making keys into characters, we could prepend a server location code so ID 100 generated at SF would not conflict with ID 100 generated in Sterling. Instead, they would be marked as S00000100 and V00000100. Another benefit is the increase of possible key combinations by being able to use alpha characters. (36^(n-1) versus 10^n) At this time, the method we use is a periodic sweep of all updated records. In every table, we add extra fields to mark the date/time the record was last inserted/updated/deleted. All records touched as of the last resync are extracted, zipped up, pgp-encrypted and then posted on an ftp server. Files are then transfered between servers, records unpacked and inserted/updated. Some checks are needed to determine what takes precedence if users updated the same record on both servers but otherwise it's a straightforward process. As far as I can tell, the performance impact seems to be minimal. There's a periodic storm of replication updates in cases where there's mass updates sync last resync. But if you have mostly reads and few writes, you shouldn't see this situation. The biggest performance impact seems to be the CPU power needed to zip/unzip/encrypt/decrypt files. I'm thinking over strats to get more "real-time" replication working. I suppose I could just make the resync program run more often but that's a bit inelegant. Perhaps I could capture every update/delete/insert/alter statement from the postgres logs, parsing them out to commands and then zipping/encrypting every command as a separate item to be processed. Or add triggers to every table where updated records are pushed to a custom "updated log". The biggest problem is of course locks -- especially at the application level. I'm still thinking over what to do here.
On Thu, 2003-07-24 at 13:25, Kasim Oztoprak wrote: > On 24 Jul 2003 17:08 EEST you wrote: > > > On 24 Jul 2003 at 15:54, Kasim Oztoprak wrote: [snip] > > we do not have memory problem or disk problems. as I have seen in the list the best way to > use disks are using raid 10 for data and raid 1 for os. we can put as much memory as > we require. > > now the question, if we have 100 searches per second and in each search if we need 30 sql > instruction, what will be the performance of the system in the order of time. Let us say > we have two machines described aove in a cluster. That's 3000 sql statements per second, 180 thousand per minute!!!! What the heck is this database doing!!!!! A quad-CPU Opteron sure is looking useful right about now... Or an quad-CPU AlphaServer ES45 running Linux, if 4x Opterons aren't available. How complicated are each of these SELECT statements? -- +-----------------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA | | | | "I'm not a vegetarian because I love animals, I'm a vegetarian | | because I hate vegetables!" | | unknown | +-----------------------------------------------------------------+
On 24 Jul 2003 23:25 EEST you wrote: > On Thu, 2003-07-24 at 13:25, Kasim Oztoprak wrote: > > On 24 Jul 2003 17:08 EEST you wrote: > > > > > On 24 Jul 2003 at 15:54, Kasim Oztoprak wrote: > [snip] > > > > we do not have memory problem or disk problems. as I have seen in the list the best way to > > use disks are using raid 10 for data and raid 1 for os. we can put as much memory as > > we require. > > > > now the question, if we have 100 searches per second and in each search if we need 30 sql > > instruction, what will be the performance of the system in the order of time. Let us say > > we have two machines described aove in a cluster. > > That's 3000 sql statements per second, 180 thousand per minute!!!! > What the heck is this database doing!!!!! > > A quad-CPU Opteron sure is looking useful right about now... Or > an quad-CPU AlphaServer ES45 running Linux, if 4x Opterons aren't > available. > > How complicated are each of these SELECT statements? this is kind of directory assistance application. actually the select statements are not very complex. the database contain 25 million subscriber records and the operators searches for the subscriber numbers or addresses. there are not much update operations actually the update ratio is approximately %0.1 . i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon processors. and suitable memory capacity with it. i hope it will overcome with this problem. any similar implementation? > > -- > ----------------------------------------------------------------- > | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | > | Jefferson, LA USA | > | | > | "I'm not a vegetarian because I love animals, I'm a vegetarian | > | because I hate vegetables!" | > | unknown | > ----------------------------------------------------------------- > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
On 25 Jul 2003 at 16:38, Kasim Oztoprak wrote: > this is kind of directory assistance application. actually the select statements are not > very complex. the database contain 25 million subscriber records and the operators searches > for the subscriber numbers or addresses. there are not much update operations actually the > update ratio is approximately %0.1 . > > i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon processors. > and suitable memory capacity with it. Are you going to duplicate the data? If you are going to have 3000 sql statements per second, I would suggest, 1. Get quad CPU. You probably need that horsepower 2. Use prepared statements and stored procedures to avoid parsing overhead. I doubt you would need cluster of machines though. If you run it thr. a pilot program, that would give you an idea whether or not you need a cluster.. Bye Shridhar -- Default, n.: The hardware's, of course.
On 24 Jul 2003 at 9:42, William Yu wrote: > As far as I can tell, the performance impact seems to be minimal. > There's a periodic storm of replication updates in cases where there's > mass updates sync last resync. But if you have mostly reads and few > writes, you shouldn't see this situation. The biggest performance impact > seems to be the CPU power needed to zip/unzip/encrypt/decrypt files. Can you use WAL based replication? I don't have a URL handy but there are replication projects which transmit WAL files to another server when they fill in. OTOH, I was thinking of a simple replication theme. If postgresql provides a hook where it calls an external library routine for each heapinsert in WAL, there could be a simple multi-slave replication system. One doesn't have to wait till WAL file fills up. Of course, it's upto the library to make sure that it does not hold postgresql commits for too long that would hamper the performance. Also there would need a receiving hook which would directly heapinsert the data on another node. But if the external library is threaded, will that work well with postgresql? Just a thought. If it works, load-balancing could be lot easy and near- realtime.. Bye Shridhar -- We fight only when there is no other choice. We prefer the ways ofpeaceful contact. -- Kirk, "Spectre of the Gun",stardate 4385.3
On Fri, 2003-07-25 at 11:38, Kasim Oztoprak wrote: > On 24 Jul 2003 23:25 EEST you wrote: > > > On Thu, 2003-07-24 at 13:25, Kasim Oztoprak wrote: > > > On 24 Jul 2003 17:08 EEST you wrote: > > > > > > > On 24 Jul 2003 at 15:54, Kasim Oztoprak wrote: > > [snip] > > > > > > we do not have memory problem or disk problems. as I have seen in the list the best way to > > > use disks are using raid 10 for data and raid 1 for os. we can put as much memory as > > > we require. > > > > > > now the question, if we have 100 searches per second and in each search if we need 30 sql > > > instruction, what will be the performance of the system in the order of time. Let us say > > > we have two machines described aove in a cluster. > > > > That's 3000 sql statements per second, 180 thousand per minute!!!! > > What the heck is this database doing!!!!! > > > > A quad-CPU Opteron sure is looking useful right about now... Or > > an quad-CPU AlphaServer ES45 running Linux, if 4x Opterons aren't > > available. > > > > How complicated are each of these SELECT statements? > > this is kind of directory assistance application. actually the select statements are not > very complex. the database contain 25 million subscriber records and the operators searches > for the subscriber numbers or addresses. there are not much update operations actually the > update ratio is approximately %0.1 . > > i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon processors. > and suitable memory capacity with it. > > i hope it will overcome with this problem. any similar implementation? Since PG doesn't have active-active clustering, that's out, but since the database will be very static, why not have, say 8 machines, each with it's own copy of the database? (Since there are so few updates, you feed the updates to a litle Perl app that then makes the changes on each machine.) (A round-robin load balancer would do the trick in utilizing them all.) Also, with lots of machines, you could get away with less expensive machines, say 2GHz CPU, 1GB RAM and a 40GB IDE drive. Then, if one goes down for some reason, you've only lost a small portion of your capacity, and replacing a part will be very inexpensive. And if volume increases, just add more USD1000 machines... -- +-----------------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA | | | | "I'm not a vegetarian because I love animals, I'm a vegetarian | | because I hate vegetables!" | | unknown | +-----------------------------------------------------------------+
On 25 Jul 2003 17:13 EEST you wrote: > On 25 Jul 2003 at 16:38, Kasim Oztoprak wrote: > > this is kind of directory assistance application. actually the select statements are not > > very complex. the database contain 25 million subscriber records and the operators searches > > for the subscriber numbers or addresses. there are not much update operations actually the > > update ratio is approximately %0.1 . > > > > i will use at least 4 machines each having 4 cpu with the speed of 2.8 ghz xeon processors. > > and suitable memory capacity with it. > > Are you going to duplicate the data? > > If you are going to have 3000 sql statements per second, I would suggest, > > 1. Get quad CPU. You probably need that horsepower > 2. Use prepared statements and stored procedures to avoid parsing overhead. > > I doubt you would need cluster of machines though. If you run it thr. a pilot > program, that would give you an idea whether or not you need a cluster.. > > Bye > Shridhar > i will try to cluster them. i can duplicate the data if i need. in the case of update, then, i will fix them through. what exactly do you mean from a pilot program? -kas�m > -- > Default, n.: The hardware's, of course. > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match
On 25 Jul 2003 at 18:41, Kasim Oztoprak wrote: > what exactly do you mean from a pilot program? Like get a quad CPU box, load the data and ask only 10 operators to test the system.. Beta testing basically.. Bye Shridhar -- The man on tops walks a lonely street; the "chain" of command is often a noose.
Folks, > Since PG doesn't have active-active clustering, that's out, but since > the database will be very static, why not have, say 8 machines, each > with it's own copy of the database? (Since there are so few updates, > you feed the updates to a litle Perl app that then makes the changes > on each machine.) (A round-robin load balancer would do the trick > in utilizing them all.) Another approach I've seen work is to have several servers connect to one SAN or NAS where the data lives. Only one server is enabled to handle "write" requests; all the rest are read-only. This does mean having dispacting middleware that parcels out requests among the servers, but works very well for the java-based company that's using it. -- Josh Berkus Aglio Database Solutions San Francisco
On Fri, 2003-07-25 at 11:13, Josh Berkus wrote: > Folks, > > > Since PG doesn't have active-active clustering, that's out, but since > > the database will be very static, why not have, say 8 machines, each > > with it's own copy of the database? (Since there are so few updates, > > you feed the updates to a litle Perl app that then makes the changes > > on each machine.) (A round-robin load balancer would do the trick > > in utilizing them all.) > > Another approach I've seen work is to have several servers connect to one SAN > or NAS where the data lives. Only one server is enabled to handle "write" > requests; all the rest are read-only. This does mean having dispacting > middleware that parcels out requests among the servers, but works very well > for the java-based company that's using it. Wouldn't the cache on the read-only databases get out of sync with the true on-disk data? -- +-----------------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA | | | | "I'm not a vegetarian because I love animals, I'm a vegetarian | | because I hate vegetables!" | | unknown | +-----------------------------------------------------------------+