Thread: What's the best hardver for PostgreSQL 8.1?
Hi! What do you suggest for the next problem? We have complex databases with some 100million rows (2-3million new records per month). Our current servers are working on low resposibility in these days, so we have to buy new hardver for database server. Some weeks ago we started to work with PostgreSQL8.1, which solved the problem for some months. There are some massive, hard query execution, which are too slow (5-10 or more minutes). The parallel processing is infrequent (rarely max. 4-5 parallel query execution). So we need high performance in query execution with medium parallel processability. What's your opinion what productions could help us? What is the best or only better choice? The budget line is about 30 000$ - 40 000$. Regards, Atesz
On Tue, Dec 20, 2005 at 07:27:15PM +0100, Antal Attila wrote: > We have complex databases with some 100million rows (2-3million new How much space does that equate to? > records per month). Our current servers are working on low resposibility > in these days, so we have to buy new hardver for database server. Some > weeks ago we started to work with PostgreSQL8.1, which solved the > problem for some months. > There are some massive, hard query execution, which are too slow (5-10 > or more minutes). The parallel processing is infrequent (rarely max. 4-5 > parallel query execution). > So we need high performance in query execution with medium parallel > processability. > What's your opinion what productions could help us? What is the best or > only better choice? > The budget line is about 30 000$ - 40 000$. Have you optimized the queries? Items that generally have the biggest impact on performance in decreasing order: 1. System architecture 2. Database design 3. (for long-running/problem queries) Query plans 4. Disk I/O 5. Memory 6. CPU So, I'd make sure that the queries have been optimized (and that includes tuning postgresql.conf) before assuming you need more hardware. Based on what you've told us (very little parallelization), then your biggest priority is probably either disk IO or memory (or both). Without knowing the size of your database/working set it's difficult to provide more specific advice. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Dec 20, 2005, at 1:27 PM, Antal Attila wrote: > The budget line is about 30 000$ - 40 000$. Like Jim said, without more specifics it is hard to give more specific recommendations, but I'm architecting something like this for my current app which needs ~100GB disk space. I made room to grow in my configuration: dual opteron 2.2GHz 4GB RAM LSI MegaRAID 320-2X 14-disk SCSI U320 enclosure with 15k RPM drives, 7 connected to each channel on the RAID. 1 pair in RAID1 mirror for OS + pg_xlog rest in RAID10 with each mirrored pair coming from opposite SCSI channels for data I run FreeBSD but whatever you prefer should be sufficient if it is not windows. I don't know how prices are in Hungary, but around here something like this with 36GB drives comes to around $11,000 or $12,000. The place I concentrate on is the disk I/O bandwidth which is why I prefer Opteron over Intel XEON.
Can you elaborate on the reasons the opteron is better than the Xeon when it comes to disk io? I have a PostgreSQL 7.4.8 box running a DSS. One of our tables is about 13 million rows. I had a number of queries against this table that used innner joins on 5 or 6 tables including the 13 million row one. The performance was atrocious. The database itself is about 20 gigs but I want it to scale to 100 gigs. I tuned postgresql as best I could and gave the server huge amounts of memory for caching as well. I also tweaked the cost parameters for a sequential scan vs an index scan of the query optimizer and used the query explain mechanism to get some idea of what the optimizer was doing and where I should index the tables. When I added the sixth table to the inner join the query performance took a nose dive. Admittedly this system is a single PIII 1000Mhz with 1.2 gigs of ram and no raid. I do have two Ultra 160 scsi drives with the database tables mount point on a partition on one physical drive and pg_xlog mount point on another partition of the second drive. I have been trying to get my employer to spring for new hardware ($8k to $10k) which I had planned to be a dual - dual core opteron system from HP. Until they agree to spend the money I resorted to writing a plpgsql functions to handle the queries. Inside plpgsql I can break the query apart into seperate stages each of which runs much faster. I can use temporary tables to store intermediate results without worrying about temp table collisions with different users thanks to transaction isolation. I am convinced we need new hardware to scale this application *but* I agree with the consensus voiced here that it is more important to optimize the query first before going out to buy new hardware. I was able to do things with PostgreSQL on this cheap server that I could never imagine doing with SQL server or even oracle on such a low end box. My OS is Fedora Core 3 but I wonder if anyone has tested and benchmarked PostgreSQL on the new Sun x64 servers running Solaris 10 x86. Thanks, Juan On Tuesday 20 December 2005 16:08, Vivek Khera wrote: > On Dec 20, 2005, at 1:27 PM, Antal Attila wrote: > > The budget line is about 30 000$ - 40 000$. > > Like Jim said, without more specifics it is hard to give more > specific recommendations, but I'm architecting something like this > for my current app which needs ~100GB disk space. I made room to > grow in my configuration: > > dual opteron 2.2GHz > 4GB RAM > LSI MegaRAID 320-2X > 14-disk SCSI U320 enclosure with 15k RPM drives, 7 connected to each > channel on the RAID. > 1 pair in RAID1 mirror for OS + pg_xlog > rest in RAID10 with each mirrored pair coming from opposite SCSI > channels for data > > I run FreeBSD but whatever you prefer should be sufficient if it is > not windows. > > I don't know how prices are in Hungary, but around here something > like this with 36GB drives comes to around $11,000 or $12,000. > > The place I concentrate on is the disk I/O bandwidth which is why I > prefer Opteron over Intel XEON. > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
On Tue, 20 Dec 2005, Juan Casero wrote: > Date: Tue, 20 Dec 2005 19:50:47 -0500 > From: Juan Casero <caseroj@comcast.net> > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] What's the best hardver for PostgreSQL 8.1? > > Can you elaborate on the reasons the opteron is better than the Xeon when it > comes to disk io? the opteron is cheaper so you have more money to spend on disks :-) also when you go into multi-cpu systems the front-side-bus design of the Xeon's can easily become your system bottleneck so that you can't take advantage of all the CPU's becouse they stall waiting for memory accesses, Opteron systems have a memory bus per socket so the more CPU's you have the more memory bandwidth you have. > The database itself is about 20 gigs > but I want it to scale to 100 gigs. how large is the working set? in your tests you ran into swapping on your 1.2G system, buying a dual opteron with 16gigs of ram will allow you to work with much larger sets of data, and you can go beyond that if needed. David Lang
Jim C. Nasby wrote: >How much space does that equate to? > > >Have you optimized the queries? > >Items that generally have the biggest impact on performance in >decreasing order: >1. System architecture >2. Database design >3. (for long-running/problem queries) Query plans >4. Disk I/O >5. Memory >6. CPU > >So, I'd make sure that the queries have been optimized (and that >includes tuning postgresql.conf) before assuming you need more hardware. > >Based on what you've told us (very little parallelization), then your >biggest priority is probably either disk IO or memory (or both). Without >knowing the size of your database/working set it's difficult to provide >more specific advice. > > Hi! We have 3 Compaq Proliant ML530 servers with dual Xeon 2.8GHz processors, 3 GB DDR RAM, Ultra Wide SCSI RAID5 10000rpm and 1000Gbit ethernet. We partitioned our databases among these machines, but there are cross refrences among the machines theoretically. Now the size of datas is about 100-110GB. We've used these servers for 3 years with Debian Linux. We have already optimized the given queries and the postgresql.conf. We tried more tricks and ideas and we read and asked on mailing lists. We cannot do anything, we should buy new server for the databases, because we develop our system for newer services, so the size will grow along. After that we need better responsiblility and shorter execution time for the big queries (These queries are too complicated to discuss here, and more times we optimized with plpgsql stored procedures.). The PostgreSQL 8.1 solved more paralellization and overload problem, the average load is decreased significantly on our servers. But the big queries aren't fast enough. We think the hardver is the limit. Generally 2 parallel guery running in working hours, after we make backups at night. Regards, Atesz
Juan Casero wrote: > Can you elaborate on the reasons the opteron is better than the Xeon when it > comes to disk io? I have a PostgreSQL 7.4.8 box running a DSS. One of our Opterons have 64-bit IOMMU -- Xeons don't. That means in 64-bit mode, transfers to > 4GB, the OS must allocated the memory < 4GB, DMA to that block and then the CPU must do extra work in copying the memory to > 4GB. Versus on the Opteron, it's done by the IO adaptor using DMA in the background.
I just sent my boss an email asking him for a Sun v20z with dual 2.2 Ghz opterons, 2 Gigs of RAM and RAID 1. I would have liked a better server capable of RAID but that seems to be out of his budget right now. Ok so I assume I get this Sun box. Most likely I will go with Linux since it is a fair bet he doesn't want to pay for the Solaris 10 x86 license. Although I kind of like the idea of using Solaris 10 x86 for this. I will assume I need to install the x64 kernel that comes with say Fedora Core 4. Should I run the Postgresql 8.x binaries in 32 bit mode or 64 bit mode? My instinct tells me 64 bit mode is most efficient for our database size about 20 gigs right now but may grow to 100 gigs in a year or so. I just finished loading a 20 gig database on a dual 900 Mhz Ultrasparc III system with 2 gigs of ram and about 768 megs of shared memory available for the posgresql server running Solaris 10. The load has smoked a P4 3.2 Ghz system I am using also with 2 gigs of ram running postgresql 8.0.3. I mean I started the sparc load after the P4 load. The sparc load has finished already rebuilding the database from a pg_dump file but the P4 system is still going. The p4 has 1.3 Gigs of shared memory allocated to postgresql. How about them apples? Thanks, Juan On Wednesday 21 December 2005 18:57, William Yu wrote: > Juan Casero wrote: > > Can you elaborate on the reasons the opteron is better than the Xeon when > > it comes to disk io? I have a PostgreSQL 7.4.8 box running a DSS. One > > of our > > Opterons have 64-bit IOMMU -- Xeons don't. That means in 64-bit mode, > transfers to > 4GB, the OS must allocated the memory < 4GB, DMA to that > block and then the CPU must do extra work in copying the memory to > > 4GB. Versus on the Opteron, it's done by the IO adaptor using DMA in the > background. > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend
AFAIK there are no licensing costs for solaris, unless you are talking about a software support agreement, which is not required.
Juan Casero wrote:
Juan Casero wrote:
I just sent my boss an email asking him for a Sun v20z with dual 2.2 Ghz opterons, 2 Gigs of RAM and RAID 1. I would have liked a better server capable of RAID but that seems to be out of his budget right now. Ok so I assume I get this Sun box. Most likely I will go with Linux since it is a fair bet he doesn't want to pay for the Solaris 10 x86 license. Although I kind of like the idea of using Solaris 10 x86 for this. I will assume I need to install the x64 kernel that comes with say Fedora Core 4. Should I run the Postgresql 8.x binaries in 32 bit mode or 64 bit mode? My instinct tells me 64 bit mode is most efficient for our database size about 20 gigs right now but may grow to 100 gigs in a year or so. I just finished loading a 20 gig database on a dual 900 Mhz Ultrasparc III system with 2 gigs of ram and about 768 megs of shared memory available for the posgresql server running Solaris 10. The load has smoked a P4 3.2 Ghz system I am using also with 2 gigs of ram running postgresql 8.0.3. I mean I started the sparc load after the P4 load. The sparc load has finished already rebuilding the database from a pg_dump file but the P4 system is still going. The p4 has 1.3 Gigs of shared memory allocated to postgresql. How about them apples? Thanks, Juan On Wednesday 21 December 2005 18:57, William Yu wrote:Juan Casero wrote:Can you elaborate on the reasons the opteron is better than the Xeon when it comes to disk io? I have a PostgreSQL 7.4.8 box running a DSS. One of ourOpterons have 64-bit IOMMU -- Xeons don't. That means in 64-bit mode, transfers to > 4GB, the OS must allocated the memory < 4GB, DMA to that block and then the CPU must do extra work in copying the memory to > 4GB. Versus on the Opteron, it's done by the IO adaptor using DMA in the background. ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
Hi Juan, Solaris 10 license is for free.. Infact I believe you do receive the media with Sun Fire V20z. If you want support then there are various "pay" plans depending on the level of support. If not your license allows you Right to Use anyway for free. That said I haven't done much testing with 32/64 bit differences. However for long term purposes, 64-bit always seems to be the safe bet. As for your load performance, lot of it depends on your file system layout also. Regards, Jignesh Juan Casero wrote: >I just sent my boss an email asking him for a Sun v20z with dual 2.2 Ghz >opterons, 2 Gigs of RAM and RAID 1. I would have liked a better server >capable of RAID but that seems to be out of his budget right now. Ok so I >assume I get this Sun box. Most likely I will go with Linux since it is a >fair bet he doesn't want to pay for the Solaris 10 x86 license. Although I >kind of like the idea of using Solaris 10 x86 for this. I will assume I >need to install the x64 kernel that comes with say Fedora Core 4. Should I >run the Postgresql 8.x binaries in 32 bit mode or 64 bit mode? My instinct >tells me 64 bit mode is most efficient for our database size about 20 gigs >right now but may grow to 100 gigs in a year or so. I just finished loading >a 20 gig database on a dual 900 Mhz Ultrasparc III system with 2 gigs of ram >and about 768 megs of shared memory available for the posgresql server >running Solaris 10. The load has smoked a P4 3.2 Ghz system I am using also >with 2 gigs of ram running postgresql 8.0.3. I mean I started the sparc >load after the P4 load. The sparc load has finished already rebuilding the >database from a pg_dump file but the P4 system is still going. The p4 has >1.3 Gigs of shared memory allocated to postgresql. How about them apples? > > >Thanks, >Juan > >On Wednesday 21 December 2005 18:57, William Yu wrote: > > >>Juan Casero wrote: >> >> >>>Can you elaborate on the reasons the opteron is better than the Xeon when >>>it comes to disk io? I have a PostgreSQL 7.4.8 box running a DSS. One >>>of our >>> >>> >>Opterons have 64-bit IOMMU -- Xeons don't. That means in 64-bit mode, >>transfers to > 4GB, the OS must allocated the memory < 4GB, DMA to that >>block and then the CPU must do extra work in copying the memory to > >>4GB. Versus on the Opteron, it's done by the IO adaptor using DMA in the >>background. >> >>---------------------------(end of broadcast)--------------------------- >>TIP 6: explain analyze is your friend >> >> > >---------------------------(end of broadcast)--------------------------- >TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > >
Sorry folks. I had a couple of glasses of wine as I wrote this. Anyway I originally wanted the box to have more than two drives so I could do RAID 5 but that is going to cost too much. Also, contrary to my statement below it seems to me I should run the 32 bit postgresql server on the 64 bit kernel. Would you agree this will probably yield the best performance? I know it depends alot on the system but for now this database is about 20 gigabytes. Not too large right now but it may grow 5x in the next year. Thanks, Juan On Wednesday 21 December 2005 22:09, Juan Casero wrote: > I just sent my boss an email asking him for a Sun v20z with dual 2.2 Ghz > opterons, 2 Gigs of RAM and RAID 1. I would have liked a better server > capable of RAID but that seems to be out of his budget right now. Ok so I > assume I get this Sun box. Most likely I will go with Linux since it is a > fair bet he doesn't want to pay for the Solaris 10 x86 license. Although I > kind of like the idea of using Solaris 10 x86 for this. I will assume I > need to install the x64 kernel that comes with say Fedora Core 4. Should I > run the Postgresql 8.x binaries in 32 bit mode or 64 bit mode? My > instinct tells me 64 bit mode is most efficient for our database size about > 20 gigs right now but may grow to 100 gigs in a year or so. I just > finished loading a 20 gig database on a dual 900 Mhz Ultrasparc III system > with 2 gigs of ram and about 768 megs of shared memory available for the > posgresql server running Solaris 10. The load has smoked a P4 3.2 Ghz > system I am using also with 2 gigs of ram running postgresql 8.0.3. I > mean I started the sparc load after the P4 load. The sparc load has > finished already rebuilding the database from a pg_dump file but the P4 > system is still going. The p4 has 1.3 Gigs of shared memory allocated to > postgresql. How about them apples? > > > Thanks, > Juan > > On Wednesday 21 December 2005 18:57, William Yu wrote: > > Juan Casero wrote: > > > Can you elaborate on the reasons the opteron is better than the Xeon > > > when it comes to disk io? I have a PostgreSQL 7.4.8 box running a > > > DSS. One of our > > > > Opterons have 64-bit IOMMU -- Xeons don't. That means in 64-bit mode, > > transfers to > 4GB, the OS must allocated the memory < 4GB, DMA to that > > block and then the CPU must do extra work in copying the memory to > > > 4GB. Versus on the Opteron, it's done by the IO adaptor using DMA in the > > background. > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 6: explain analyze is your friend > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
On Wed, 21 Dec 2005, Juan Casero wrote: > Date: Wed, 21 Dec 2005 22:31:54 -0500 > From: Juan Casero <caseroj@comcast.net> > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] What's the best hardver for PostgreSQL 8.1? > > Sorry folks. I had a couple of glasses of wine as I wrote this. Anyway I > originally wanted the box to have more than two drives so I could do RAID 5 > but that is going to cost too much. Also, contrary to my statement below it > seems to me I should run the 32 bit postgresql server on the 64 bit kernel. > Would you agree this will probably yield the best performance? you definantly need a 64 bit kernel to address as much ram as you will need. the question of 32 bit vs 64 bit postgres needs to be benchmarked, but my inclination is that you probably do want 64 bit for that as well. 64 bit binaries are slightly larger then 32 bit ones (less so on x86/AMD64 then on any other mixed platform though), but the 64 bit version also has access to twice as many registers as a 32 bit one, and the Opteron chips have some other features that become availabel in 64 bit mode (or more useful) like everything else this needs benchmarks to prove with your workload (I'm trying to get some started, but haven't had a chance yet) David Lang > I know it > depends alot on the system but for now this database is about 20 gigabytes. > Not too large right now but it may grow 5x in the next year. > > Thanks, > Juan > > On Wednesday 21 December 2005 22:09, Juan Casero wrote: >> I just sent my boss an email asking him for a Sun v20z with dual 2.2 Ghz >> opterons, 2 Gigs of RAM and RAID 1. I would have liked a better server >> capable of RAID but that seems to be out of his budget right now. Ok so I >> assume I get this Sun box. Most likely I will go with Linux since it is a >> fair bet he doesn't want to pay for the Solaris 10 x86 license. Although I >> kind of like the idea of using Solaris 10 x86 for this. I will assume I >> need to install the x64 kernel that comes with say Fedora Core 4. Should I >> run the Postgresql 8.x binaries in 32 bit mode or 64 bit mode? My >> instinct tells me 64 bit mode is most efficient for our database size about >> 20 gigs right now but may grow to 100 gigs in a year or so. I just >> finished loading a 20 gig database on a dual 900 Mhz Ultrasparc III system >> with 2 gigs of ram and about 768 megs of shared memory available for the >> posgresql server running Solaris 10. The load has smoked a P4 3.2 Ghz >> system I am using also with 2 gigs of ram running postgresql 8.0.3. I >> mean I started the sparc load after the P4 load. The sparc load has >> finished already rebuilding the database from a pg_dump file but the P4 >> system is still going. The p4 has 1.3 Gigs of shared memory allocated to >> postgresql. How about them apples? >> >> >> Thanks, >> Juan >> >> On Wednesday 21 December 2005 18:57, William Yu wrote: >>> Juan Casero wrote: >>>> Can you elaborate on the reasons the opteron is better than the Xeon >>>> when it comes to disk io? I have a PostgreSQL 7.4.8 box running a >>>> DSS. One of our >>> >>> Opterons have 64-bit IOMMU -- Xeons don't. That means in 64-bit mode, >>> transfers to > 4GB, the OS must allocated the memory < 4GB, DMA to that >>> block and then the CPU must do extra work in copying the memory to > >>> 4GB. Versus on the Opteron, it's done by the IO adaptor using DMA in the >>> background. >>> >>> ---------------------------(end of broadcast)--------------------------- >>> TIP 6: explain analyze is your friend >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 1: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Ok thanks. I think I will go with 64 bit everything on the box. If I can get the Sun Fire V20Z then I will stick with Solaris 10 x86 and download the 64 bit PostgreSQL 8.1 binaries from blastwave.org. I develop the PHP code to my DSS system on my Windows XP laptop. Normally, I test the code on this laptop but let it hit the live database when I want to run some tests. Well just this afternoon I installed PostgreSQL 8.1.1 on my windows laptop and rebuilt the the entire live database instance on there from a pg_dump archive. I am blown away by the performance increase in PostgreSQL 8.1.x. Has anyone else had a chance to test it? All the queries I run against it are remarkably fast but more importantly I can see that the two cores of my Hyper Threaded P4 are being used. One of the questions I posted on this list was whether PostgreSQL could make use of the large number of cores available on the Ultrasparc T1000/T2000 cores. I am beginning to think that with PostgreSQL 8.1.x the buffer manager could indeed use all those cores. This could make running a DSS or OLTP on an Ultrasparc T1000/T2000 with PostgreSQL a much better bargain than on an intel system. Any thoughts? Thanks, Juan On Thursday 22 December 2005 22:12, David Lang wrote: > On Wed, 21 Dec 2005, Juan Casero wrote: > > Date: Wed, 21 Dec 2005 22:31:54 -0500 > > From: Juan Casero <caseroj@comcast.net> > > To: pgsql-performance@postgresql.org > > Subject: Re: [PERFORM] What's the best hardver for PostgreSQL 8.1? > > > > Sorry folks. I had a couple of glasses of wine as I wrote this. Anyway > > I originally wanted the box to have more than two drives so I could do > > RAID 5 but that is going to cost too much. Also, contrary to my > > statement below it seems to me I should run the 32 bit postgresql server > > on the 64 bit kernel. Would you agree this will probably yield the best > > performance? > > you definantly need a 64 bit kernel to address as much ram as you will > need. > > the question of 32 bit vs 64 bit postgres needs to be benchmarked, but my > inclination is that you probably do want 64 bit for that as well. > > 64 bit binaries are slightly larger then 32 bit ones (less so on x86/AMD64 > then on any other mixed platform though), but the 64 bit version also has > access to twice as many registers as a 32 bit one, and the Opteron chips > have some other features that become availabel in 64 bit mode (or more > useful) > > like everything else this needs benchmarks to prove with your workload > (I'm trying to get some started, but haven't had a chance yet) > > David Lang > > > I know it > > depends alot on the system but for now this database is about 20 > > gigabytes. Not too large right now but it may grow 5x in the next year. > > > > Thanks, > > Juan > > > > On Wednesday 21 December 2005 22:09, Juan Casero wrote: > >> I just sent my boss an email asking him for a Sun v20z with dual 2.2 Ghz > >> opterons, 2 Gigs of RAM and RAID 1. I would have liked a better server > >> capable of RAID but that seems to be out of his budget right now. Ok so > >> I assume I get this Sun box. Most likely I will go with Linux since it > >> is a fair bet he doesn't want to pay for the Solaris 10 x86 license. > >> Although I kind of like the idea of using Solaris 10 x86 for this. I > >> will assume I need to install the x64 kernel that comes with say Fedora > >> Core 4. Should I run the Postgresql 8.x binaries in 32 bit mode or 64 > >> bit mode? My instinct tells me 64 bit mode is most efficient for our > >> database size about 20 gigs right now but may grow to 100 gigs in a year > >> or so. I just finished loading a 20 gig database on a dual 900 Mhz > >> Ultrasparc III system with 2 gigs of ram and about 768 megs of shared > >> memory available for the posgresql server running Solaris 10. The load > >> has smoked a P4 3.2 Ghz system I am using also with 2 gigs of ram > >> running postgresql 8.0.3. I mean I started the sparc load after the P4 > >> load. The sparc load has finished already rebuilding the database from > >> a pg_dump file but the P4 system is still going. The p4 has 1.3 Gigs of > >> shared memory allocated to postgresql. How about them apples? > >> > >> > >> Thanks, > >> Juan > >> > >> On Wednesday 21 December 2005 18:57, William Yu wrote: > >>> Juan Casero wrote: > >>>> Can you elaborate on the reasons the opteron is better than the Xeon > >>>> when it comes to disk io? I have a PostgreSQL 7.4.8 box running a > >>>> DSS. One of our > >>> > >>> Opterons have 64-bit IOMMU -- Xeons don't. That means in 64-bit mode, > >>> transfers to > 4GB, the OS must allocated the memory < 4GB, DMA to that > >>> block and then the CPU must do extra work in copying the memory to > > >>> 4GB. Versus on the Opteron, it's done by the IO adaptor using DMA in > >>> the background. > >>> > >>> ---------------------------(end of > >>> broadcast)--------------------------- TIP 6: explain analyze is your > >>> friend > >> > >> ---------------------------(end of broadcast)--------------------------- > >> TIP 1: if posting/reading through Usenet, please send an appropriate > >> subscribe-nomail command to majordomo@postgresql.org so that your > >> message can get through to the mailing list cleanly > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 2: Don't 'kill -9' the postmaster > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster
On Thu, 22 Dec 2005, Juan Casero wrote: > Ok thanks. I think I will go with 64 bit everything on the box. If I can get > the Sun Fire V20Z then I will stick with Solaris 10 x86 and download the 64 > bit PostgreSQL 8.1 binaries from blastwave.org. I develop the PHP code to > my DSS system on my Windows XP laptop. Normally, I test the code on this > laptop but let it hit the live database when I want to run some tests. Well > just this afternoon I installed PostgreSQL 8.1.1 on my windows laptop and > rebuilt the the entire live database instance on there from a pg_dump > archive. I am blown away by the performance increase in PostgreSQL 8.1.x. > Has anyone else had a chance to test it? All the queries I run against it > are remarkably fast but more importantly I can see that the two cores of my > Hyper Threaded P4 are being used. One of the questions I posted on this > list was whether PostgreSQL could make use of the large number of cores > available on the Ultrasparc T1000/T2000 cores. I am beginning to think that > with PostgreSQL 8.1.x the buffer manager could indeed use all those cores. > This could make running a DSS or OLTP on an Ultrasparc T1000/T2000 with > PostgreSQL a much better bargain than on an intel system. Any thoughts? if you have enough simultanious transactions, and your I/O systems (disk and memory interfaces) can keep up with your needs then postgres can use quite a few cores. there are some limits that will show up with more cores, but I don't think it's well known where they are (this will also be very dependant on your workload as well). there was the discussion within the last month or two that hit the postgres weekly news where more attention is being paied to the locking mechanisms used so this is an area under active development (note especially that some locking strategies that work well with multiple full cores can be crippling with virtual cores (Intel HT etc). but it boils down to the fact that there just isn't enough experiance with the new sun systems to know how well they will work. they could end up being fabulous speed demons, or dogs (and it could even be both, depending on your workload) David Lang > Thanks, > Juan > > On Thursday 22 December 2005 22:12, David Lang wrote: >> On Wed, 21 Dec 2005, Juan Casero wrote: >>> Date: Wed, 21 Dec 2005 22:31:54 -0500 >>> From: Juan Casero <caseroj@comcast.net> >>> To: pgsql-performance@postgresql.org >>> Subject: Re: [PERFORM] What's the best hardver for PostgreSQL 8.1? >>> >>> Sorry folks. I had a couple of glasses of wine as I wrote this. Anyway >>> I originally wanted the box to have more than two drives so I could do >>> RAID 5 but that is going to cost too much. Also, contrary to my >>> statement below it seems to me I should run the 32 bit postgresql server >>> on the 64 bit kernel. Would you agree this will probably yield the best >>> performance? >> >> you definantly need a 64 bit kernel to address as much ram as you will >> need. >> >> the question of 32 bit vs 64 bit postgres needs to be benchmarked, but my >> inclination is that you probably do want 64 bit for that as well. >> >> 64 bit binaries are slightly larger then 32 bit ones (less so on x86/AMD64 >> then on any other mixed platform though), but the 64 bit version also has >> access to twice as many registers as a 32 bit one, and the Opteron chips >> have some other features that become availabel in 64 bit mode (or more >> useful) >> >> like everything else this needs benchmarks to prove with your workload >> (I'm trying to get some started, but haven't had a chance yet) >> >> David Lang >> >>> I know it >>> depends alot on the system but for now this database is about 20 >>> gigabytes. Not too large right now but it may grow 5x in the next year. >>> >>> Thanks, >>> Juan >>> >>> On Wednesday 21 December 2005 22:09, Juan Casero wrote: >>>> I just sent my boss an email asking him for a Sun v20z with dual 2.2 Ghz >>>> opterons, 2 Gigs of RAM and RAID 1. I would have liked a better server >>>> capable of RAID but that seems to be out of his budget right now. Ok so >>>> I assume I get this Sun box. Most likely I will go with Linux since it >>>> is a fair bet he doesn't want to pay for the Solaris 10 x86 license. >>>> Although I kind of like the idea of using Solaris 10 x86 for this. I >>>> will assume I need to install the x64 kernel that comes with say Fedora >>>> Core 4. Should I run the Postgresql 8.x binaries in 32 bit mode or 64 >>>> bit mode? My instinct tells me 64 bit mode is most efficient for our >>>> database size about 20 gigs right now but may grow to 100 gigs in a year >>>> or so. I just finished loading a 20 gig database on a dual 900 Mhz >>>> Ultrasparc III system with 2 gigs of ram and about 768 megs of shared >>>> memory available for the posgresql server running Solaris 10. The load >>>> has smoked a P4 3.2 Ghz system I am using also with 2 gigs of ram >>>> running postgresql 8.0.3. I mean I started the sparc load after the P4 >>>> load. The sparc load has finished already rebuilding the database from >>>> a pg_dump file but the P4 system is still going. The p4 has 1.3 Gigs of >>>> shared memory allocated to postgresql. How about them apples? >>>> >>>> >>>> Thanks, >>>> Juan >>>> >>>> On Wednesday 21 December 2005 18:57, William Yu wrote: >>>>> Juan Casero wrote: >>>>>> Can you elaborate on the reasons the opteron is better than the Xeon >>>>>> when it comes to disk io? I have a PostgreSQL 7.4.8 box running a >>>>>> DSS. One of our >>>>> >>>>> Opterons have 64-bit IOMMU -- Xeons don't. That means in 64-bit mode, >>>>> transfers to > 4GB, the OS must allocated the memory < 4GB, DMA to that >>>>> block and then the CPU must do extra work in copying the memory to > >>>>> 4GB. Versus on the Opteron, it's done by the IO adaptor using DMA in >>>>> the background. >>>>> >>>>> ---------------------------(end of >>>>> broadcast)--------------------------- TIP 6: explain analyze is your >>>>> friend >>>> >>>> ---------------------------(end of broadcast)--------------------------- >>>> TIP 1: if posting/reading through Usenet, please send an appropriate >>>> subscribe-nomail command to majordomo@postgresql.org so that your >>>> message can get through to the mailing list cleanly >>> >>> ---------------------------(end of broadcast)--------------------------- >>> TIP 2: Don't 'kill -9' the postmaster >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 2: Don't 'kill -9' the postmaster > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Dec 22, 2005, at 11:14 PM, David Lang wrote: > but it boils down to the fact that there just isn't enough > experiance with the new sun systems to know how well they will > work. they could end up being fabulous speed demons, or dogs (and > it could even be both, depending on your workload) The v20z isn't the newest sun hardware anyhow... The X2100, X4100, and X4200 are. I've been trying to buy an X4100 for going on three weeks now but the local sun reseller is making it very hard. you'd think they'd actually want to go out of their way to make a sale but they seem to do the opposite. for those of you who say 'well, it is a small sale' my original request was for over $50k in equipment, and after a while decided that other equipment from other vendors who do care was sufficient, and only the opteron boxes needed to come from sun. add a zero return policy and you wonder how they expect to keep in business.... sorry, i had to vent. but once it does come in I'll be glad to post up some numbers :-)
Vivek Khera wrote: > and only the > opteron boxes needed to come from sun. add a zero return policy and you > wonder how they expect to keep in business.... > > sorry, i had to vent. > Just out of interest - why did the opterons need to come from Sun?
On Dec 23, 2005, at 5:15 PM, Mark Kirkwood wrote: > Vivek Khera wrote: > >> and only the opteron boxes needed to come from sun. add a zero >> return policy and you wonder how they expect to keep in business.... >> sorry, i had to vent. >> > > Just out of interest - why did the opterons need to come from Sun? There are three tier-1 vendors selling opteron: IBM, Sun, and HP. HP's have historically had slow RAID configurations, and IBM tries to hide them and only offers really one model, a 1U unit. I've already been through buying opteron systems from the smaller vendors and it basically wasted a lot of my time due to what seems like quality control issues. So it could be Sun or IBM. IBM seems to make it harder to buy from them than Sun...
On Wed, 21 Dec 2005 22:31:54 -0500 Juan Casero <caseroj@comcast.net> wrote: > Sorry folks. I had a couple of glasses of wine as I wrote this. > Anyway I originally wanted the box to have more than two drives so I > could do RAID 5 but that is going to cost too much. Also, contrary > to my statement below it seems to me I should run the 32 bit > postgresql server on the 64 bit kernel. Would you agree this will > probably yield the best performance? I know it depends alot on the > system but for now this database is about 20 gigabytes. Not too large > right now but it may grow 5x in the next year. You definitely DO NOT want to do RAID 5 on a database server. That is probably the worst setup you could have, I've seen it have lower performance than just a single hard disk. RAID 1 and RAID 1+0 are optimal, but you want to stay far away from RAID 5. IMHO RAID 5 is only useful on near line backup servers or Samba file servers where space is more important than speed. --------------------------------- Frank Wiles <frank@wiles.org> http://www.wiles.org ---------------------------------
At 02:50 PM 12/24/2005, Frank Wiles wrote: >On Wed, 21 Dec 2005 22:31:54 -0500 >Juan Casero <caseroj@comcast.net> wrote: > > > Sorry folks. I had a couple of glasses of wine as I wrote this. > > Anyway I originally wanted the box to have more than two drives so I > > could do RAID 5 but that is going to cost too much. Also, contrary > > to my statement below it seems to me I should run the 32 bit > > postgresql server on the 64 bit kernel. Would you agree this will > > probably yield the best performance? I know it depends alot on the > > system but for now this database is about 20 gigabytes. Not too large > > right now but it may grow 5x in the next year. > > You definitely DO NOT want to do RAID 5 on a database server. That > is probably the worst setup you could have, I've seen it have lower > performance than just a single hard disk. > > RAID 1 and RAID 1+0 are optimal, but you want to stay far away from > RAID 5. IMHO RAID 5 is only useful on near line backup servers or > Samba file servers where space is more important than speed. That's a bit misleading. RAID 5 excels when you want read speed but don't care as much about write speed. Writes are typical ~2/3 the speed of reads on a typical decent RAID 5 set up. Side Note: Some years ago Mylex had a family of fast (for the time) RAID 5 HW controllers that actually read and wrote at the same speed. IBM bought them to kill them and protect LSI Logic. Mylex X24's (?IIRC the model number correctly?) are still reasonable HW. So if you have tables that are read often and written to rarely or not at all, putting them on RAID 5 is optimal. In both data mining like and OLTP like apps there are usually at least some such tables. RAID 1 is good for stuff where speed doesn't matter and all you are looking for is an insurance policy. RAID 10 is the best way to get high performance on both reads and writes, but it has a significantly greater cost for the same amount of usable physical media. If you've got the budget or are dealing with small enough physical storage needs, by all means use RAID 10. OTOH, if you are dealing with large enterprise class apps like Sarbanes Oxley compliance, medical and/or insurance, etc, etc, the storage needs can get so large that RAID 10 for everything or even most things is not possible. Even if economically feasible. RAID levels are like any other tool. Each is useful in the proper circumstances. Happy holidays, Ron Peacetree
> > If you've got the budget or are dealing with small enough physical > storage needs, by all means use RAID 10. OTOH, if you are dealing > with large enterprise class apps like Sarbanes Oxley compliance, > medical and/or insurance, etc, etc, the storage needs can get so large > that RAID 10 for everything or even most things is not possible. Even > if economically feasible. > > RAID levels are like any other tool. Each is useful in the proper > circumstances. > There is also RAID 50 which is quite nice. Joshua D. Drake > Happy holidays, > Ron Peacetree > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: PLphp, PLperl - http://www.commandprompt.com/
On Sat, 24 Dec 2005, Ron wrote: > At 02:50 PM 12/24/2005, Frank Wiles wrote: >> Juan Casero <caseroj@comcast.net> wrote: >> >> > Sorry folks. I had a couple of glasses of wine as I wrote this. >> > Anyway I originally wanted the box to have more than two drives so I >> > could do RAID 5 but that is going to cost too much. Also, contrary >> > to my statement below it seems to me I should run the 32 bit >> > postgresql server on the 64 bit kernel. Would you agree this will >> > probably yield the best performance? I know it depends alot on the >> > system but for now this database is about 20 gigabytes. Not too large >> > right now but it may grow 5x in the next year. >> >> You definitely DO NOT want to do RAID 5 on a database server. That >> is probably the worst setup you could have, I've seen it have lower >> performance than just a single hard disk. >> >> RAID 1 and RAID 1+0 are optimal, but you want to stay far away from >> RAID 5. IMHO RAID 5 is only useful on near line backup servers or >> Samba file servers where space is more important than speed. > That's a bit misleading. RAID 5 excels when you want read speed but don't > care as much about write speed. Writes are typical ~2/3 the speed of reads > on a typical decent RAID 5 set up. > > So if you have tables that are read often and written to rarely or not at > all, putting them on RAID 5 is optimal. In both data mining like and OLTP > like apps there are usually at least some such tables. raid 5 is bad for random writes as you state, but how does it do for sequential writes (for example data mining where you do a large import at one time, but seldom do other updates). I'm assuming a controller with a reasonable amount of battery-backed cache. David Lang
At 04:42 PM 12/24/2005, Joshua D. Drake wrote: >>If you've got the budget or are dealing with small enough physical >>storage needs, by all means use RAID 10. OTOH, if you are dealing >>with large enterprise class apps like Sarbanes Oxley compliance, >>medical and/or insurance, etc, etc, the storage needs can get so >>large that RAID 10 for everything or even most things is not >>possible. Even if economically feasible. >> >>RAID levels are like any other tool. Each is useful in the proper >>circumstances. >There is also RAID 50 which is quite nice. The "quite nice" part that Joshua is referring to is that RAID 50 gets most of the write performance of RAID 10 w/o using nearly as many HD's as RAID 10. OTOH, there still is a significant increase in the number of HD's used, and that means MBTF's become more frequent but you are not getting protection levels you would with RAID 10. IME RAID 50 gets mixed reviews. My two biggest issues are a= Admin of RAID 50 is more complex than the other commonly used versions (1, 10, 5, and 6) b= Once a HD failure takes place, you suffer a _permenent_ performance drop, even after the automatic volume rebuild, until you take the entire RAID 50 array off line, reinitialize it, and rebuild it from scratch. IME "a" and "b" make RAID 50 inappropriate for any but the biggest and most dedicated of DB admin groups. YMMV, Ron
David Lang wrote: > raid 5 is bad for random writes as you state, but how does it do for > sequential writes (for example data mining where you do a large import > at one time, but seldom do other updates). I'm assuming a controller > with a reasonable amount of battery-backed cache. Random write performance (small block that only writes to 1 drive): 1 write requires N-1 reads + N writes --> 1/2N-1 % Sequential write performance (write big enough block to use all N drives): N-1 Write requires N writes --> N-1/N % Assuming enough cache so all reads/writes are done in 1 transaction + onboard processor calcs RAID parity fast enough to not cause an extra delay.
At 04:54 PM 12/24/2005, David Lang wrote: >raid 5 is bad for random writes as you state, but how does it do for >sequential writes (for example data mining where you do a large >import at one time, but seldom do other updates). I'm assuming a >controller with a reasonable amount of battery-backed cache. The issue with RAID 5 writes centers on the need to recalculate checksums for the ECC blocks distributed across the array and then write the new ones to physical media. Caches help, and the bigger the cache the better, but once you are doing enough writes fast enough (and that doesn't take much even with a few GBs of cache) the recalculate-checksums-and-write-new-ones overhead will decrease the write speed of real data. Bear in mind that the HD's _raw_ write speed hasn't been decreased. Those HD's are pounding away as fast as they can for you. Your _effective_ or _data level_ write speed is what decreases due to overhead. Side Note: people often forget the other big reason to use RAID 10 over RAID 5. RAID 5 is always only 2 HD failures from data loss. RAID 10 can lose up to 1/2 the HD's in the array w/o data loss unless you get unlucky and lose both members of a RAID 1 set. This can be seen as an example of the classic space vs. time trade off in performance tuning. You can use 2x the HDs you need and implement RAID 10 for best performance and reliability or you can dedicate less HD's to RAID and implement RAID 5 for less (write) performance and lower reliability. TANSTAAFL. Ron Peacetree
Hi, > b= Once a HD failure takes place, you suffer a _permenent_ performance > drop, even after the automatic volume rebuild, until you take the entire > RAID 50 array off line, reinitialize it, and rebuild it from scratch. Where did you get that crazy idea? When you have replaced the drive and the RAID is rebuilt, you have exactly the same situation as before the drive failed. Why would you get less performance? Sander.
Frank, > You definitely DO NOT want to do RAID 5 on a database server. That > is probably the worst setup you could have, I've seen it have lower > performance than just a single hard disk. I've seen that on RAID0 and RAID10 as well. This is more about the quality and modernity of the RAID controller than anything else at this point, although there are some theoretical advantages of RAID10 from a random seek standpoint even if the adapter CPU is infinitely fast at checksumming. We're using RAID5 in practice for OLAP / Data Warehousing systems very successfully using the newest RAID cards from 3Ware (9550SX). Note that host-based SCSI raid cards from LSI, Adaptec, Intel, Dell, HP and others have proven to have worse performance than a single disk drive in many cases, whether for RAID0 or RAID5. In most circumstances I've seen, people don't even notice until they write a message to a mailing list about "my query runs slowly on xxx dbms". In many cases, after they run a simple sequential transfer rate test using dd, they see that their RAID controller is the culprit. Recently, I helped a company named DeepData to improve their dbms performance, which was a combination of moving them to software RAID50 on Linux and getting them onto Bizgres. The disk subsystem sped up on the same hardware (minus the HW RAID card) by over a factor of 10. The downside is that SW RAID is a pain in the neck for management - you have to shut down the Linux host when a disk fails to replace it. - Luke
On Sat, 24 Dec 2005, Luke Lonergan wrote: > Recently, I helped a company named DeepData to improve their dbms > performance, which was a combination of moving them to software RAID50 > on Linux and getting them onto Bizgres. The disk subsystem sped up on > the same hardware (minus the HW RAID card) by over a factor of 10. The > downside is that SW RAID is a pain in the neck for management - you have > to shut down the Linux host when a disk fails to replace it. Luke, you should not need to shut down the linux host when a disk fails. you should be able to use mdadm to mark the drive as failed, then remove it from the system and replace it, then use mdadm to add the drive to the array. I'm fighting through a double disk failure on my system at home and when I hit a bad spot on a drive (failing it from the array) I can just re-add it without having to restart everything (if it's the second drive I will have to stop and restart the array, but that's becouse the entire array has failed at that point) now hot-swap may not be supported on all interface types, that may be what you have run into, but with SCSI or SATA you should be able to hot-swap with the right controller. David Lang
David, > now hot-swap may not be supported on all interface types, that may be what > you have run into, but with SCSI or SATA you should be able to hot-swap > with the right controller. That's actually the problem - Linux hot swap is virtually non-functional for SCSI. You can write into the proper placesin /proc, then remove and rescan to get a new drive up, but I've found that the resulting OS state is flaky. Thisis true of the latest 2.6 kernels and LSI and Adaptec SCSI controllers. The problems I've seen are with Linux, not the controllers. - Luke
Luke Lonergan wrote: > Note that host-based SCSI raid cards from LSI, Adaptec, Intel, Dell, HP > and others have proven to have worse performance than a single disk > drive in many cases, whether for RAID0 or RAID5. In most circumstances This is my own experience. Running a LSI MegaRAID in pure passthrough mode + Linux software RAID10 is a ton faster than configuring the RAID via the LSI card. One of the things I've noticed is that the card does not seem to be able to parallel read on mirrors. While looking at iostat under Linux, I can see software RAID1 reading all drives and the MD number adding up to the sum of all drives. The ARECA SATA controller I just got though doesn't seem to exhibit these problems. Performance is a few % points above Linux software RAID at lower CPU usage. In fact, I'm getting better single-threaded bandwidth on a 4x7200RPM SATA config versus a 6x15K SCSI config on the LSI. The drives are bigger for the SATA drive (300GB) versus 36GB for the SCSI so that means the heads don't have to move any where as much and can stay on the fast portion of the disk. Haven't had a chance to test multi-user DB between the two setup though.
Luke Lonergan wrote: >David, > > > >>now hot-swap may not be supported on all interface types, that may be what >>you have run into, but with SCSI or SATA you should be able to hot-swap >>with the right controller. >> >> > >That's actually the problem - Linux hot swap is virtually non-functional for SCSI. You can write into the proper placesin /proc, then remove and rescan to get a new drive up, but I've found that the resulting OS state is flaky. Thisis true of the latest 2.6 kernels and LSI and Adaptec SCSI controllers. > >The problems I've seen are with Linux, not the controllers. > > Interesting, I have had zero problems with Linux and SATA with LSI controllers and hot plug. I wonder what the difference is. The LSI controller even though SATA just uses the scsi driver. Joshua D. Drake >- Luke > > > >---------------------------(end of broadcast)--------------------------- >TIP 2: Don't 'kill -9' the postmaster > > -- The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: PLphp, PLperl - http://www.commandprompt.com/
On Sat, 24 Dec 2005, Luke Lonergan wrote: > David, > >> now hot-swap may not be supported on all interface types, that may be what >> you have run into, but with SCSI or SATA you should be able to hot-swap >> with the right controller. > > That's actually the problem - Linux hot swap is virtually non-functional for SCSI. You can write into the proper placesin /proc, then remove and rescan to get a new drive up, but I've found that the resulting OS state is flaky. Thisis true of the latest 2.6 kernels and LSI and Adaptec SCSI controllers. > > The problems I've seen are with Linux, not the controllers. Thanks for the clarification, I knew that PATA didn't do hotswap, and I've seen discussions on the linux-kernel list about SATA hotswap being worked on, but I thought that scsi handled it. how recent a kernel have you had problems with? David Lang
On Sun, Dec 25, 2005 at 04:13:57AM -0800, David Lang wrote: > Thanks for the clarification, I knew that PATA didn't do hotswap, and I've > seen discussions on the linux-kernel list about SATA hotswap being worked > on, but I thought that scsi handled it. how recent a kernel have you had > problems with? Is has largely worked for us, even though it's a bit hackish -- you _must_ disconnect the drive properly in the kernel before ejecting it physically, though, or it will never reconnect. At least that's how it is with our Adaptec 19160. /* Steinar */ -- Homepage: http://www.sesse.net/
On Sat, Dec 24, 2005 at 22:13:43 -0500, Luke Lonergan <LLonergan@greenplum.com> wrote: > David, > > > now hot-swap may not be supported on all interface types, that may be what > > you have run into, but with SCSI or SATA you should be able to hot-swap > > with the right controller. > > That's actually the problem - Linux hot swap is virtually non-functional for SCSI. You can write into the proper placesin /proc, then remove and rescan to get a new drive up, but I've found that the resulting OS state is flaky. Thisis true of the latest 2.6 kernels and LSI and Adaptec SCSI controllers. > > The problems I've seen are with Linux, not the controllers. The other option is to keep hot spares available so that you can have a failure or two before you have to pull drives out. This might allow you to get to a maintenance window to swap out the bad drives.
On Sat, Dec 24, 2005 at 05:45:20PM -0500, Ron wrote: >Caches help, and the bigger the cache the better, but once you are >doing enough writes fast enough (and that doesn't take much even with >a few GBs of cache) the recalculate-checksums-and-write-new-ones >overhead will decrease the write speed of real data. Bear in mind >that the HD's _raw_ write speed hasn't been decreased. Those HD's >are pounding away as fast as they can for you. Your _effective_ or >_data level_ write speed is what decreases due to overhead. You're overgeneralizing. Assuming a large cache and a sequential write, there's need be no penalty for raid 5. (For random writes you may need to read unrelated blocks in order to calculate parity, but for large sequential writes the parity blocks should all be read from cache.) A modern cpu can calculate parity for raid 5 on the order of gigabytes per second, and even crummy embedded processors can do hundreds of megabytes per second. You may have run into some lousy implementations, but you should be much more specific about what hardware you're talking about instead of making sweeping generalizations. >Side Note: people often forget the other big reason to use RAID 10 >over RAID 5. RAID 5 is always only 2 HD failures from data >loss. RAID 10 can lose up to 1/2 the HD's in the array w/o data loss >unless you get unlucky and lose both members of a RAID 1 set. IOW, your RAID 10 is only 2 HD failures from data loss also. If that's an issue you need to go with RAID 6 or add another disk to each mirror. Mike Stone
Have you done any benchmarking of the 9550SX against a software raid configuration?
Luke Lonergan wrote:
Luke Lonergan wrote:
Frank,You definitely DO NOT want to do RAID 5 on a database server. That is probably the worst setup you could have, I've seen it have lower performance than just a single hard disk.I've seen that on RAID0 and RAID10 as well. This is more about the quality and modernity of the RAID controller than anything else at this point, although there are some theoretical advantages of RAID10 from a random seek standpoint even if the adapter CPU is infinitely fast at checksumming. We're using RAID5 in practice for OLAP / Data Warehousing systems very successfully using the newest RAID cards from 3Ware (9550SX). Note that host-based SCSI raid cards from LSI, Adaptec, Intel, Dell, HP and others have proven to have worse performance than a single disk drive in many cases, whether for RAID0 or RAID5. In most circumstances I've seen, people don't even notice until they write a message to a mailing list about "my query runs slowly on xxx dbms". In many cases, after they run a simple sequential transfer rate test using dd, they see that their RAID controller is the culprit. Recently, I helped a company named DeepData to improve their dbms performance, which was a combination of moving them to software RAID50 on Linux and getting them onto Bizgres. The disk subsystem sped up on the same hardware (minus the HW RAID card) by over a factor of 10. The downside is that SW RAID is a pain in the neck for management - you have to shut down the Linux host when a disk fails to replace it. - Luke ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
Benjamin, > Have you done any benchmarking of the 9550SX against a software raid configuration? Interesting - no, not on SATA, mostly because I've had awful luck with Linux drivers and SATA. The popular manufacturersof SATA to PCI bridge chipsets are Silicon Image and Highpoint, and I've not seen Linux work with them at anyreasonable performance yet. I've also had problems with Adaptec's cards - I think they manufacture their own SATA toPCI chipset as well. So far, I've only had good luck with the on-chipset Intel SATA implementation. I think the problemsI've had could be entirely driver-related, but in the end it doesn't matter if you can't find drivers that work forLinux. The other problem is getting enough SATA connections for the number of disks we want. I do have two new Areca SATA RAIDcards and I'm going to benchmark those against the 3Ware 9550SX with 2 x 8 = 16 disks on one host. I guess we could run the HW RAID controllers in JBOD mode to get a good driver / chipset configuration for software RAID,but frankly I prefer HW RAID if it performs well. So far the SATA host-based RAID is blowing the doors off of everyother HW RAID solution I've tested. - Luke
It's irrelavent what controller, you still have to actualy write the parity blocks, which slows down your write speed because you have to write n+n/2 blocks. instead of just n blocks making the system write 50% more data. RAID 5 must write 50% more data to disk therefore it will always be slower. Alex. On 12/25/05, Michael Stone <mstone+postgres@mathom.us> wrote: > On Sat, Dec 24, 2005 at 05:45:20PM -0500, Ron wrote: > >Caches help, and the bigger the cache the better, but once you are > >doing enough writes fast enough (and that doesn't take much even with > >a few GBs of cache) the recalculate-checksums-and-write-new-ones > >overhead will decrease the write speed of real data. Bear in mind > >that the HD's _raw_ write speed hasn't been decreased. Those HD's > >are pounding away as fast as they can for you. Your _effective_ or > >_data level_ write speed is what decreases due to overhead. > > You're overgeneralizing. Assuming a large cache and a sequential write, > there's need be no penalty for raid 5. (For random writes you may > need to read unrelated blocks in order to calculate parity, but for > large sequential writes the parity blocks should all be read from > cache.) A modern cpu can calculate parity for raid 5 on the order of > gigabytes per second, and even crummy embedded processors can do > hundreds of megabytes per second. You may have run into some lousy > implementations, but you should be much more specific about what > hardware you're talking about instead of making sweeping > generalizations. > > >Side Note: people often forget the other big reason to use RAID 10 > >over RAID 5. RAID 5 is always only 2 HD failures from data > >loss. RAID 10 can lose up to 1/2 the HD's in the array w/o data loss > >unless you get unlucky and lose both members of a RAID 1 set. > > IOW, your RAID 10 is only 2 HD failures from data loss also. If that's > an issue you need to go with RAID 6 or add another disk to each mirror. > > Mike Stone > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
On Mon, 26 Dec 2005, Alex Turner wrote: > It's irrelavent what controller, you still have to actualy write the > parity blocks, which slows down your write speed because you have to > write n+n/2 blocks. instead of just n blocks making the system write > 50% more data. > > RAID 5 must write 50% more data to disk therefore it will always be slower. raid5 writes n+1 blocks not n+n/2 (unless n=2 for a 3-disk raid). you can have a 15+1 disk raid5 array for example however raid1 (and raid10) have to write 2*n blocks to disk. so if you are talking about pure I/O needed raid5 wins hands down. (the same 16 drives would be a 8+8 array) what slows down raid 5 is that to modify a block you have to read blocks from all your drives to re-calculate the parity. this interleaving of reads and writes when all you are logicly doing is writes can really hurt. (this is why I asked the question that got us off on this tangent, when doing new writes to an array you don't have to read the blocks as they are blank, assuming your cacheing is enough so that you can write blocksize*n before the system starts actually writing the data) David Lang > Alex. > > On 12/25/05, Michael Stone <mstone+postgres@mathom.us> wrote: >> On Sat, Dec 24, 2005 at 05:45:20PM -0500, Ron wrote: >>> Caches help, and the bigger the cache the better, but once you are >>> doing enough writes fast enough (and that doesn't take much even with >>> a few GBs of cache) the recalculate-checksums-and-write-new-ones >>> overhead will decrease the write speed of real data. Bear in mind >>> that the HD's _raw_ write speed hasn't been decreased. Those HD's >>> are pounding away as fast as they can for you. Your _effective_ or >>> _data level_ write speed is what decreases due to overhead. >> >> You're overgeneralizing. Assuming a large cache and a sequential write, >> there's need be no penalty for raid 5. (For random writes you may >> need to read unrelated blocks in order to calculate parity, but for >> large sequential writes the parity blocks should all be read from >> cache.) A modern cpu can calculate parity for raid 5 on the order of >> gigabytes per second, and even crummy embedded processors can do >> hundreds of megabytes per second. You may have run into some lousy >> implementations, but you should be much more specific about what >> hardware you're talking about instead of making sweeping >> generalizations. >> >>> Side Note: people often forget the other big reason to use RAID 10 >>> over RAID 5. RAID 5 is always only 2 HD failures from data >>> loss. RAID 10 can lose up to 1/2 the HD's in the array w/o data loss >>> unless you get unlucky and lose both members of a RAID 1 set. >> >> IOW, your RAID 10 is only 2 HD failures from data loss also. If that's >> an issue you need to go with RAID 6 or add another disk to each mirror. >> >> Mike Stone >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 6: explain analyze is your friend >> > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org >
Have you have any experience rebuilding arrays in linux using the 3Ware utilities? If so, did it work well?
Luke Lonergan wrote:
Luke Lonergan wrote:
Benjamin,Have you done any benchmarking of the 9550SX against a software raid configuration?Interesting - no, not on SATA, mostly because I've had awful luck with Linux drivers and SATA. The popular manufacturers of SATA to PCI bridge chipsets are Silicon Image and Highpoint, and I've not seen Linux work with them at any reasonable performance yet. I've also had problems with Adaptec's cards - I think they manufacture their own SATA to PCI chipset as well. So far, I've only had good luck with the on-chipset Intel SATA implementation. I think the problems I've had could be entirely driver-related, but in the end it doesn't matter if you can't find drivers that work for Linux. The other problem is getting enough SATA connections for the number of disks we want. I do have two new Areca SATA RAID cards and I'm going to benchmark those against the 3Ware 9550SX with 2 x 8 = 16 disks on one host. I guess we could run the HW RAID controllers in JBOD mode to get a good driver / chipset configuration for software RAID, but frankly I prefer HW RAID if it performs well. So far the SATA host-based RAID is blowing the doors off of every other HW RAID solution I've tested. - Luke
Benjamin, On 12/26/05 10:21 AM, "Benjamin Arai" <barai@cs.ucr.edu> wrote: > Have you have any experience rebuilding arrays in linux using the 3Ware > utilities? If so, did it work well? Sure we have - nowadays with disks failing as much as they do how could we not? ;-) 3Ware has some *nice* tools - including a web browser utility for managing the RAID. Rebuilds have been super easy - and the e-mail notification is fine. They even have some decent migration options. What they don't have are tools like snapshot backup, like EMC has, or SRDF or any of the enterprise SAN features. We don't need them because Bizgres MPP takes care of the need in software, but some people have become accustomed to the features for other uses. We're pretty happy with 3Ware, but their new 9550SX is, well, new. We managed to find a good enough combination of driver and firmware to make it work well on CentOs 4.1 and that's good enough for us, but there are definitely some issues with some combinations now. Lastly, you do have to set the block device readahead to 16MB to get performance. - Luke
Yes - they work excellently. I have several medium and large servers running 3ware 9500S series cards with great success. We have rebuilding many failed RAID 10s over the course with no problems. Alex On 12/26/05, Benjamin Arai <barai@cs.ucr.edu> wrote: > Have you have any experience rebuilding arrays in linux using the 3Ware > utilities? If so, did it work well? > > > Luke Lonergan wrote: > Benjamin, > > > > Have you done any benchmarking of the 9550SX against a software raid > configuration? > > > Interesting - no, not on SATA, mostly because I've had awful luck with Linux > drivers and SATA. The popular manufacturers of SATA to PCI bridge chipsets > are Silicon Image and Highpoint, and I've not seen Linux work with them at > any reasonable performance yet. I've also had problems with Adaptec's cards > - I think they manufacture their own SATA to PCI chipset as well. So far, > I've only had good luck with the on-chipset Intel SATA implementation. I > think the problems I've had could be entirely driver-related, but in the end > it doesn't matter if you can't find drivers that work for Linux. > > The other problem is getting enough SATA connections for the number of disks > we want. I do have two new Areca SATA RAID cards and I'm going to benchmark > those against the 3Ware 9550SX with 2 x 8 = 16 disks on one host. > > I guess we could run the HW RAID controllers in JBOD mode to get a good > driver / chipset configuration for software RAID, but frankly I prefer HW > RAID if it performs well. So far the SATA host-based RAID is blowing the > doors off of every other HW RAID solution I've tested. > > - Luke > > > > > -- > Benjamin Arai > barai@cs.ucr.edu > http://www.benjaminarai.com
Yes, but those blocks in RAID 10 are largely irrelevant as they are to independant disks. In RAID 5 you have to write parity to an 'active' drive that is part of the stripe. (They are irrelevant unless of course you are maxing out your SCSI bus - yet another reason why SATA can be faster than SCSI, particularly in RAID 10, every channel is independant). Sorry - my math for RAID 5 was a bit off - I don't know why I was considering only a three dirve situation - which is the worst. It's n+1 you are right. still, for small arrays thats a big penalty. Still, there is definately a penatly contrary to the assertion of the orignal poster. I agree totally that the read+parity-calc+write in the worst case is totaly bad, which is why I alway recommend people should _never ever_ use RAID 5. In this day and age of large capacity chassis, and large capacity SATA drives, RAID 5 is totally inapropriate IMHO for _any_ application least of all databases. In reality I have yet to benchmark a system where RAID 5 on the same number of drives with 8 drives or less in a single array beat a RAID 10 with the same number of drives. I would definately be interested in a SCSI card that could actualy achieve the theoretical performance of RAID 5 especially under Linux. With RAID 5 you get to watch you system crumble and fail when a drive fails and the array goes into a failed state. It's just not worth it. Alex. On 12/26/05, David Lang <dlang@invendra.net> wrote: > On Mon, 26 Dec 2005, Alex Turner wrote: > > > It's irrelavent what controller, you still have to actualy write the > > parity blocks, which slows down your write speed because you have to > > write n+n/2 blocks. instead of just n blocks making the system write > > 50% more data. > > > > RAID 5 must write 50% more data to disk therefore it will always be slower. > > raid5 writes n+1 blocks not n+n/2 (unless n=2 for a 3-disk raid). you can > have a 15+1 disk raid5 array for example > > however raid1 (and raid10) have to write 2*n blocks to disk. so if you are > talking about pure I/O needed raid5 wins hands down. (the same 16 drives > would be a 8+8 array) > > what slows down raid 5 is that to modify a block you have to read blocks > from all your drives to re-calculate the parity. this interleaving of > reads and writes when all you are logicly doing is writes can really hurt. > (this is why I asked the question that got us off on this tangent, when > doing new writes to an array you don't have to read the blocks as they are > blank, assuming your cacheing is enough so that you can write blocksize*n > before the system starts actually writing the data) > > David Lang > > > Alex. > > > > On 12/25/05, Michael Stone <mstone+postgres@mathom.us> wrote: > >> On Sat, Dec 24, 2005 at 05:45:20PM -0500, Ron wrote: > >>> Caches help, and the bigger the cache the better, but once you are > >>> doing enough writes fast enough (and that doesn't take much even with > >>> a few GBs of cache) the recalculate-checksums-and-write-new-ones > >>> overhead will decrease the write speed of real data. Bear in mind > >>> that the HD's _raw_ write speed hasn't been decreased. Those HD's > >>> are pounding away as fast as they can for you. Your _effective_ or > >>> _data level_ write speed is what decreases due to overhead. > >> > >> You're overgeneralizing. Assuming a large cache and a sequential write, > >> there's need be no penalty for raid 5. (For random writes you may > >> need to read unrelated blocks in order to calculate parity, but for > >> large sequential writes the parity blocks should all be read from > >> cache.) A modern cpu can calculate parity for raid 5 on the order of > >> gigabytes per second, and even crummy embedded processors can do > >> hundreds of megabytes per second. You may have run into some lousy > >> implementations, but you should be much more specific about what > >> hardware you're talking about instead of making sweeping > >> generalizations. > >> > >>> Side Note: people often forget the other big reason to use RAID 10 > >>> over RAID 5. RAID 5 is always only 2 HD failures from data > >>> loss. RAID 10 can lose up to 1/2 the HD's in the array w/o data loss > >>> unless you get unlucky and lose both members of a RAID 1 set. > >> > >> IOW, your RAID 10 is only 2 HD failures from data loss also. If that's > >> an issue you need to go with RAID 6 or add another disk to each mirror. > >> > >> Mike Stone > >> > >> ---------------------------(end of broadcast)--------------------------- > >> TIP 6: explain analyze is your friend > >> > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 4: Have you searched our list archives? > > > > http://archives.postgresql.org > > >
On Mon, 26 Dec 2005, Alex Turner wrote: > > Yes, but those blocks in RAID 10 are largely irrelevant as they are to > independant disks. In RAID 5 you have to write parity to an 'active' > drive that is part of the stripe. (They are irrelevant unless of > course you are maxing out your SCSI bus - yet another reason why SATA > can be faster than SCSI, particularly in RAID 10, every channel is > independant). I don't understand your 'active' vs 'inactive' drive argument, in raid 1 or 1+0 all drives are active. with good components you need to worry about maxing out your PCI bus as much as any other one (this type of thing is where the hardware raid has a definante advantage since the card handles the extra I/O, not your system) > Sorry - my math for RAID 5 was a bit off - I don't know why I was > considering only a three dirve situation - which is the worst. It's > n+1 you are right. still, for small arrays thats a big penalty. > Still, there is definately a penatly contrary to the assertion of the > orignal poster. > > I agree totally that the read+parity-calc+write in the worst case is > totaly bad, which is why I alway recommend people should _never ever_ > use RAID 5. In this day and age of large capacity chassis, and large > capacity SATA drives, RAID 5 is totally inapropriate IMHO for _any_ > application least of all databases. > > In reality I have yet to benchmark a system where RAID 5 on the same > number of drives with 8 drives or less in a single array beat a RAID > 10 with the same number of drives. I would definately be interested > in a SCSI card that could actualy achieve the theoretical performance > of RAID 5 especially under Linux. but it's not a 'same number of drives' comparison you should be makeing. if you have a 8 drive RAID5 array you need to compare it with a 14 drive RAID1/10 array. > With RAID 5 you get to watch you system crumble and fail when a drive > fails and the array goes into a failed state. It's just not worth it. speed is worth money (and therefor number of drives) in some cases, but not in all cases. also the speed penalty when you have a raid drive fail varies based on your controller it's wrong to flatly rule out any RAID configuration, they all have their place and the important thing is to understand what the advantages and disadvantages are for each of them so you can know when to use each one. for example I have a situation I am looking at where RAID0 is looking appropriate for a database (a multi-TB array that gets completely reloaded every month or so as data expires and new data is loaded from the authoritative source, adding another 16 drives to get redundancy isn't reasonable) David Lang > Alex. > > > On 12/26/05, David Lang <dlang@invendra.net> wrote: >> On Mon, 26 Dec 2005, Alex Turner wrote: >> >>> It's irrelavent what controller, you still have to actualy write the >>> parity blocks, which slows down your write speed because you have to >>> write n+n/2 blocks. instead of just n blocks making the system write >>> 50% more data. >>> >>> RAID 5 must write 50% more data to disk therefore it will always be slower. >> >> raid5 writes n+1 blocks not n+n/2 (unless n=2 for a 3-disk raid). you can >> have a 15+1 disk raid5 array for example >> >> however raid1 (and raid10) have to write 2*n blocks to disk. so if you are >> talking about pure I/O needed raid5 wins hands down. (the same 16 drives >> would be a 8+8 array) >> >> what slows down raid 5 is that to modify a block you have to read blocks >> from all your drives to re-calculate the parity. this interleaving of >> reads and writes when all you are logicly doing is writes can really hurt. >> (this is why I asked the question that got us off on this tangent, when >> doing new writes to an array you don't have to read the blocks as they are >> blank, assuming your cacheing is enough so that you can write blocksize*n >> before the system starts actually writing the data) >> >> David Lang >> >>> Alex. >>> >>> On 12/25/05, Michael Stone <mstone+postgres@mathom.us> wrote: >>>> On Sat, Dec 24, 2005 at 05:45:20PM -0500, Ron wrote: >>>>> Caches help, and the bigger the cache the better, but once you are >>>>> doing enough writes fast enough (and that doesn't take much even with >>>>> a few GBs of cache) the recalculate-checksums-and-write-new-ones >>>>> overhead will decrease the write speed of real data. Bear in mind >>>>> that the HD's _raw_ write speed hasn't been decreased. Those HD's >>>>> are pounding away as fast as they can for you. Your _effective_ or >>>>> _data level_ write speed is what decreases due to overhead. >>>> >>>> You're overgeneralizing. Assuming a large cache and a sequential write, >>>> there's need be no penalty for raid 5. (For random writes you may >>>> need to read unrelated blocks in order to calculate parity, but for >>>> large sequential writes the parity blocks should all be read from >>>> cache.) A modern cpu can calculate parity for raid 5 on the order of >>>> gigabytes per second, and even crummy embedded processors can do >>>> hundreds of megabytes per second. You may have run into some lousy >>>> implementations, but you should be much more specific about what >>>> hardware you're talking about instead of making sweeping >>>> generalizations. >>>> >>>>> Side Note: people often forget the other big reason to use RAID 10 >>>>> over RAID 5. RAID 5 is always only 2 HD failures from data >>>>> loss. RAID 10 can lose up to 1/2 the HD's in the array w/o data loss >>>>> unless you get unlucky and lose both members of a RAID 1 set. >>>> >>>> IOW, your RAID 10 is only 2 HD failures from data loss also. If that's >>>> an issue you need to go with RAID 6 or add another disk to each mirror. >>>> >>>> Mike Stone >>>> >>>> ---------------------------(end of broadcast)--------------------------- >>>> TIP 6: explain analyze is your friend >>>> >>> >>> ---------------------------(end of broadcast)--------------------------- >>> TIP 4: Have you searched our list archives? >>> >>> http://archives.postgresql.org >>> >> >
On 12/26/05, David Lang <dlang@invendra.net> wrote: > raid5 writes n+1 blocks not n+n/2 (unless n=2 for a 3-disk raid). you can > have a 15+1 disk raid5 array for example > > however raid1 (and raid10) have to write 2*n blocks to disk. so if you are > talking about pure I/O needed raid5 wins hands down. (the same 16 drives > would be a 8+8 array) > > what slows down raid 5 is that to modify a block you have to read blocks > from all your drives to re-calculate the parity. this interleaving of > reads and writes when all you are logicly doing is writes can really hurt. > (this is why I asked the question that got us off on this tangent, when > doing new writes to an array you don't have to read the blocks as they are > blank, assuming your cacheing is enough so that you can write blocksize*n > before the system starts actually writing the data) Not exactly true. Let's assume you have a 4+1 RAID5 (drives A, B, C, D and E), and you want to update drive A. Let's assume the parity is stored in this particular write on drive E. One way to write it is: write A, read A, B, C, D, combine A+B+C+D and write it E. (4 reads + 2 writes) The other way to write it is: read oldA, read old parity oldE write newA, write E = oldE + (newA-oldA) -- calculate difference between new and old A, and apply it to old parity, then write (2 reads + 2 writes) The more drives you have, the smarter it is to use the second approach, unless of course A, B, C and D are available in the cache, which is the niciest situation. Regards, Dawid
On Mon, Dec 26, 2005 at 12:32:19PM -0500, Alex Turner wrote: >It's irrelavent what controller, you still have to actualy write the >parity blocks, which slows down your write speed because you have to >write n+n/2 blocks. instead of just n blocks making the system write >50% more data. > >RAID 5 must write 50% more data to disk therefore it will always be >slower. At this point you've drifted into complete nonsense mode. On Mon, Dec 26, 2005 at 10:11:00AM -0800, David Lang wrote: >what slows down raid 5 is that to modify a block you have to read blocks >from all your drives to re-calculate the parity. this interleaving of >reads and writes when all you are logicly doing is writes can really hurt. >(this is why I asked the question that got us off on this tangent, when >doing new writes to an array you don't have to read the blocks as they are >blank, assuming your cacheing is enough so that you can write blocksize*n >before the system starts actually writing the data) Correct; there's no reason for the controller to read anything back if your write will fill a complete stripe. That's why I said that there isn't a "RAID 5 penalty" assuming you've got a reasonably fast controller and you're doing large sequential writes (or have enough cache that random writes can be batched as large sequential writes). On Mon, Dec 26, 2005 at 06:04:40PM -0500, Alex Turner wrote: >Yes, but those blocks in RAID 10 are largely irrelevant as they are to >independant disks. In RAID 5 you have to write parity to an 'active' >drive that is part of the stripe. Once again, this doesn't make any sense. Can you explain which parts of a RAID 10 array are inactive? >I agree totally that the read+parity-calc+write in the worst case is >totaly bad, which is why I alway recommend people should _never ever_ >use RAID 5. In this day and age of large capacity chassis, and large >capacity SATA drives, RAID 5 is totally inapropriate IMHO for _any_ >application least of all databases. So I've got a 14 drive chassis full of 300G SATA disks and need at least 3.5TB of data storage. In your mind the only possible solution is to buy another 14 drive chassis? Must be nice to never have a budget. Must be a hard sell if you've bought decent enough hardware that your benchmarks can't demonstrate a difference between a RAID 5 and a RAID 10 configuration on that chassis except in degraded mode (and the customer doesn't want to pay double for degraded mode performance). >In reality I have yet to benchmark a system where RAID 5 on the same >number of drives with 8 drives or less in a single array beat a RAID >10 with the same number of drives. Well, those are frankly little arrays, probably on lousy controllers... Mike Stone
At 08:35 AM 12/27/2005, Michael Stone wrote: >On Mon, Dec 26, 2005 at 10:11:00AM -0800, David Lang wrote: >>what slows down raid 5 is that to modify a block you have to read >>blocks from all your drives to re-calculate the parity. this >>interleaving of reads and writes when all you are logicly doing is >>writes can really hurt. (this is why I asked the question that got >>us off on this tangent, when doing new writes to an array you don't >>have to read the blocks as they are blank, assuming your cacheing >>is enough so that you can write blocksize*n before the system >>starts actually writing the data) > >Correct; there's no reason for the controller to read anything back >if your write will fill a complete stripe. That's why I said that >there isn't a "RAID 5 penalty" assuming you've got a reasonably fast >controller and you're doing large sequential writes (or have enough >cache that random writes can be batched as large sequential writes). Sorry. A decade+ RWE in production with RAID 5 using controllers as bad as Adaptec and as good as Mylex, Chaparral, LSI Logic (including their Engino stuff), and Xyratex under 5 different OS's (Sun, Linux, M$, DEC, HP) on each of Oracle, SQL Server, DB2, mySQL, and pg shows that RAID 5 writes are slower than RAID 5 reads With the one notable exception of the Mylex controller that was so good IBM bought Mylex to put them out of business. Enough IO load, random or sequential, will cause the effect no matter how much cache you have or how fast the controller is. The even bigger problem that everyone is ignoring here is that large RAID 5's spend increasingly larger percentages of their time with 1 failed HD in them. The math of having that many HDs operating simultaneously 24x7 makes it inevitable. This means you are operating in degraded mode an increasingly larger percentage of the time under exactly the circumstance you least want to be. In addition, you are =one= HD failure from data loss on that array an increasingly larger percentage of the time under exactly the least circumstances you want to be. RAID 5 is not a silver bullet. > On Mon, Dec 26, 2005 at 06:04:40PM -0500, Alex Turner wrote: >>Yes, but those blocks in RAID 10 are largely irrelevant as they are >>to independant disks. In RAID 5 you have to write parity to an >>'active' drive that is part of the stripe. > >Once again, this doesn't make any sense. Can you explain which parts of >a RAID 10 array are inactive? > >>I agree totally that the read+parity-calc+write in the worst case >>is totaly bad, which is why I alway recommend people should _never >>ever_ use RAID 5. In this day and age of large capacity chassis, >>and large capacity SATA drives, RAID 5 is totally inapropriate IMHO >>for _any_ application least of all databases. I vote with Michael here. This is an extreme position to take that can't be followed under many circumstances ITRW. >So I've got a 14 drive chassis full of 300G SATA disks and need at >least 3.5TB of data storage. In your mind the only possible solution >is to buy another 14 drive chassis? Must be nice to never have a budget. I think you mean an infinite budget. That's even assuming it's possible to get the HD's you need. I've had arrays that used all the space I could give them in 160 HD cabinets. Two 160 HD cabinets was neither within the budget nor going to perform well. I =had= to use RAID 5. RAID 10 was just not usage efficient enough. >Must be a hard sell if you've bought decent enough hardware that >your benchmarks can't demonstrate a difference between a RAID 5 and >a RAID 10 configuration on that chassis except in degraded mode (and >the customer doesn't want to pay double for degraded mode performance) I have =never= had this situation. RAID 10 latency is better than RAID 5 latency. RAID 10 write speed under heavy enough load, of any type, is faster than RAID 5 write speed under the same circumstances. RAID 10 robustness is better as well. Problem is that sometimes budget limits or number of HDs needed limits mean you can't use RAID 10. >>In reality I have yet to benchmark a system where RAID 5 on the >>same number of drives with 8 drives or less in a single array beat >>a RAID 10 with the same number of drives. > >Well, those are frankly little arrays, probably on lousy controllers... Nah. Regardless of controller I can take any RAID 5 and any RAID 10 built on the same HW under the same OS running the same DBMS and =guarantee= there is an IO load above which it can be shown that the RAID 10 will do writes faster than the RAID 5. The only exception in my career thus far has been the aforementioned Mylex controller. OTOH, sometimes you have no choice but to "take the hit" and use RAID 5. cheers, Ron
Historically, I have heard that RAID5 is only faster than RAID10 if there are six or more drives. --------------------------------------------------------------------------- Ron wrote: > At 08:35 AM 12/27/2005, Michael Stone wrote: > >On Mon, Dec 26, 2005 at 10:11:00AM -0800, David Lang wrote: > >>what slows down raid 5 is that to modify a block you have to read > >>blocks from all your drives to re-calculate the parity. this > >>interleaving of reads and writes when all you are logicly doing is > >>writes can really hurt. (this is why I asked the question that got > >>us off on this tangent, when doing new writes to an array you don't > >>have to read the blocks as they are blank, assuming your cacheing > >>is enough so that you can write blocksize*n before the system > >>starts actually writing the data) > > > >Correct; there's no reason for the controller to read anything back > >if your write will fill a complete stripe. That's why I said that > >there isn't a "RAID 5 penalty" assuming you've got a reasonably fast > >controller and you're doing large sequential writes (or have enough > >cache that random writes can be batched as large sequential writes). > > Sorry. A decade+ RWE in production with RAID 5 using controllers as > bad as Adaptec and as good as Mylex, Chaparral, LSI Logic (including > their Engino stuff), and Xyratex under 5 different OS's (Sun, Linux, > M$, DEC, HP) on each of Oracle, SQL Server, DB2, mySQL, and pg shows > that RAID 5 writes are slower than RAID 5 reads > > With the one notable exception of the Mylex controller that was so > good IBM bought Mylex to put them out of business. > > Enough IO load, random or sequential, will cause the effect no matter > how much cache you have or how fast the controller is. > > The even bigger problem that everyone is ignoring here is that large > RAID 5's spend increasingly larger percentages of their time with 1 > failed HD in them. The math of having that many HDs operating > simultaneously 24x7 makes it inevitable. > > This means you are operating in degraded mode an increasingly larger > percentage of the time under exactly the circumstance you least want > to be. In addition, you are =one= HD failure from data loss on that > array an increasingly larger percentage of the time under exactly the > least circumstances you want to be. > > RAID 5 is not a silver bullet. > > > > On Mon, Dec 26, 2005 at 06:04:40PM -0500, Alex Turner wrote: > >>Yes, but those blocks in RAID 10 are largely irrelevant as they are > >>to independant disks. In RAID 5 you have to write parity to an > >>'active' drive that is part of the stripe. > > > >Once again, this doesn't make any sense. Can you explain which parts of > >a RAID 10 array are inactive? > > > >>I agree totally that the read+parity-calc+write in the worst case > >>is totaly bad, which is why I alway recommend people should _never > >>ever_ use RAID 5. In this day and age of large capacity chassis, > >>and large capacity SATA drives, RAID 5 is totally inapropriate IMHO > >>for _any_ application least of all databases. > I vote with Michael here. This is an extreme position to take that > can't be followed under many circumstances ITRW. > > > >So I've got a 14 drive chassis full of 300G SATA disks and need at > >least 3.5TB of data storage. In your mind the only possible solution > >is to buy another 14 drive chassis? Must be nice to never have a budget. > > I think you mean an infinite budget. That's even assuming it's > possible to get the HD's you need. I've had arrays that used all the > space I could give them in 160 HD cabinets. Two 160 HD cabinets was > neither within the budget nor going to perform well. I =had= to use > RAID 5. RAID 10 was just not usage efficient enough. > > > >Must be a hard sell if you've bought decent enough hardware that > >your benchmarks can't demonstrate a difference between a RAID 5 and > >a RAID 10 configuration on that chassis except in degraded mode (and > >the customer doesn't want to pay double for degraded mode performance) > > I have =never= had this situation. RAID 10 latency is better than > RAID 5 latency. RAID 10 write speed under heavy enough load, of any > type, is faster than RAID 5 write speed under the same > circumstances. RAID 10 robustness is better as well. > > Problem is that sometimes budget limits or number of HDs needed > limits mean you can't use RAID 10. > > > >>In reality I have yet to benchmark a system where RAID 5 on the > >>same number of drives with 8 drives or less in a single array beat > >>a RAID 10 with the same number of drives. > > > >Well, those are frankly little arrays, probably on lousy controllers... > Nah. Regardless of controller I can take any RAID 5 and any RAID 10 > built on the same HW under the same OS running the same DBMS and > =guarantee= there is an IO load above which it can be shown that the > RAID 10 will do writes faster than the RAID 5. The only exception in > my career thus far has been the aforementioned Mylex controller. > > OTOH, sometimes you have no choice but to "take the hit" and use RAID 5. > > > cheers, > Ron > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Tue, Dec 27, 2005 at 11:50:16AM -0500, Ron wrote: >Sorry. A decade+ RWE in production with RAID 5 using controllers as >bad as Adaptec and as good as Mylex, Chaparral, LSI Logic (including >their Engino stuff), and Xyratex under 5 different OS's (Sun, Linux, >M$, DEC, HP) on each of Oracle, SQL Server, DB2, mySQL, and pg shows >that RAID 5 writes are slower than RAID 5 reads What does that have to do with anything? That wasn't the question... >RAID 5 is not a silver bullet. Who said it was? Nothing is, not even RAID 10. The appropriate thing to do is to make decisions based on requirements, not to make sweeping statements that eliminate entire categories of solutions based on hand waving. Mike Stone
Bruce, On 12/27/05 9:51 AM, "Bruce Momjian" <pgman@candle.pha.pa.us> wrote: > Historically, I have heard that RAID5 is only faster than RAID10 if > there are six or more drives. I think the real question here is "faster for what?" Also, just like the optimizer tunables for cpu/disk/memory speed relationships, the standing guidance for RAID has become outdated. Couple that with the predominance of really bad hardware RAID controllers and people not testing them or reporting their performance (HP, Adaptec, LSI, Dell) and we've got a mess. All we can really do is report success with various point solutions. RAID5 and RAID50 work fine for our customers who do OLAP type applications which are read-mostly. However, it only works well on good hardware and software, which at this time include the HW RAID controllers from 3Ware and reputedly Areca and SW using Linux SW RAID. I've heard that the external storage RAID controllers from EMC work well, and I'd suspect there are others, but none of the host-based SCSI HW RAID controllers I've tested work well on Linux. I say Linux, because I'm pretty sure that the HP smartarray controllers work well on Windows, but the Linux driver is so bad I'd say it doesn't work at all. WRT RAID10, it seems like throwing double the number of disks at the problems is something to be avoided if possible, though the random write performance may be important for OLTP. I think this assertion should be retested however in light of the increased speed of checksumming hardware and / or CPUs and faster, more effective drive electronics (write combining, write cache, etc). - Luke
Bruce, On 12/27/05 9:51 AM, "Bruce Momjian" <pgman@candle.pha.pa.us> wrote: > Historically, I have heard that RAID5 is only faster than RAID10 if > there are six or more drives. Speaking of testing / proof, check this site out: http://www.wlug.org.nz/HarddiskBenchmarks I really like the idea - post your bonnie++ results so people can learn from your configurations. We've built a performance reporting site, but we can't seem to get it into shape for release. I'd really like to light a performance leaderboard / experiences site up somewhere... - Luke
At 02:05 PM 12/27/2005, Michael Stone wrote: >On Tue, Dec 27, 2005 at 11:50:16AM -0500, Ron wrote: >>Sorry. A decade+ RWE in production with RAID 5 using controllers >>as bad as Adaptec and as good as Mylex, Chaparral, LSI Logic >>(including their Engino stuff), and Xyratex under 5 different OS's >>(Sun, Linux, M$, DEC, HP) on each of Oracle, SQL Server, DB2, >>mySQL, and pg shows that RAID 5 writes are slower than RAID 5 reads > >What does that have to do with anything? That wasn't the question... Your quoted position is "there isn't a 'RAID 5 penalty' assuming you've got a reasonably fast controller and you're doing large sequential writes (or have enough cache that random writes can be batched as large sequential writes)." My experience across a wide range of HW, OSs, DBMS, and applications says you are wrong. Given enough IO, RAID 5 takes a bigger performance hit for writes than RAID 10 does. Enough IO, sequential or otherwise, will result in a situation where a RAID 10 array using the same number of HDs (and therefore of ~1/2 the usable capacity) will have better write performance than the equivalent RAID 5 built using the same number of HDs. There is a 'RAID 5 write penalty'. Said RAID 10 array will also be more robust than a RAID 5 built using the same number of HDs. OTOH, that does not make RAID 5 "bad". Nor are statements like "Never use RAID 5!" realistic or reasonable. Also, performance is not the only or even most important reason for choosing RAID 10 or RAID 50 over RAID 5. Robustness considerations can be more important than performance ones. cheers, Ron
On Tue, Dec 27, 2005 at 02:57:13PM -0500, Ron wrote: >Your quoted position is "there isn't a 'RAID 5 penalty' assuming >you've got a reasonably fast controller and you're doing large >sequential writes (or have enough cache that random writes can be >batched as large sequential writes)." And you said that RAID 5 writes are slower than reads. That's a completely different statement. The traditional meaning of "RAID 5 penalty" is the cost of reading a stripe to calculate parity if only a small part of the stripe changes. It has a special name because it can result in a condition that the performance is catastrophically worse than an optimal workload, or even the single-disk non-RAID case. It's still an issue, but might not be relevant for a particular workload. (Hence the recommendation to benchmark.) >My experience across a wide range of HW, OSs, DBMS, and applications >says you are wrong. Given enough IO, RAID 5 takes a bigger >performance hit for writes than RAID 10 does. I don't understand why you keep using the pejorative term "performance hit". Try describing the "performance characteristics" instead. Also, claims about performance claims based on experience are fairly useless. Either you have data to provide (in which case claiming vast experience is unnecessary) or you don't. >Said RAID 10 array will also be more robust than a RAID 5 built using >the same number of HDs. And a RAID 6 will be more robust than either. Basing reliability on "hopefully you wont have both disks in a mirror fail" is just silly. Either you need double disk failure protection or you don't. Mike Stone
At 04:15 PM 12/27/2005, Michael Stone wrote: >I don't understand why you keep using the pejorative term "performance >hit". Try describing the "performance characteristics" instead. pe·jor·a·tive ( P ) Pronunciation Key (p-jôr-tv, -jr-, pj-rtv, pj-) adj. Tending to make or become worse. Disparaging; belittling. RAID 5 write performance is significantly enough less than RAID 5 read performance as to be a matter of professional note and concern. That's not "disparaging or belittling" nor is it "tending to make or become worse". It's measurable fact that has an adverse impact on capacity planning, budgeting, HW deployment, etc. If you consider calling a provable decrease in performance while doing a certain task that has such effects "a hit" or "bad" pejorative, you are using a definition for the word that is different than the standard one. >Also, claims about performance claims based on experience are fairly useless. >Either you have data to provide (in which case claiming vast experience >is unnecessary) or you don't. My experience _is_ the data provided. Isn't it convenient for you that I don't have the records for every job I've done in 20 years, nor do I necessarily have the right to release some specifics for some of what I do have. I've said what I can as a service to the community. Including to you. Your reaction implies that I and others with perhaps equally or more valuable experience to share shouldn't bother. "One of the major differences between Man and Beast is that Man learns from others experience." It's also impressive that you evidently seem to be implying that you do such records for your own job experience _and_ that you have the legal right to publish them. In which case, please feel free to impress me further by doing so. >>Said RAID 10 array will also be more robust >>than a RAID 5 built using the same number of HDs. > >And a RAID 6 will be more robust than either. Basing reliability on >"hopefully you wont have both disks in a mirror fail" is just silly. >Either you need double disk failure protection or you don't. That statement is incorrect and ignores both probability and real world statistical failure patterns. The odds of a RAID 10 array of n HDs suffering a failure that loses data are less than the odds of it happening in a RAID 6 array of n HDs. You are correct that RAID 6 is more robust than RAID 5. cheers, Ron
Hi, William, William Yu wrote: > Random write performance (small block that only writes to 1 drive): > 1 write requires N-1 reads + N writes --> 1/2N-1 % This is not true. Most Raid-5 engines use XOR or similar checksum methods. As opposed to cryptographic checksums, those can be updated and corrected incrementally. check_new = check_old xor data_old xor data_new So 2 reads and 2 writes are enough: read data and checksum, then adjust the checksum via the data difference, and write data and new checksum. And often, the old data block still is in cache, accounting to 1 read and two writes. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org