Thread: Hardware needed for 15,000,000 record DB?
Hello, :-) Can someone help me spec out the hardware needed for a simple web-based database app? Basically, that application needs to lookup a single row by its primary key. This would be fairly straightforward, except that the table needs to contain up to 15 million records. Each row will contain approximately 25 variable length CHAR fields with perhaps a total of 3 to 4 kilobytes of data per row. Updates will be done nightly via some sort of a batch process. What kind of hardware would be needed for this sort of application? The queries are not complex, it's just a lot of data. Would a midrange Celeron processor with 256 MB RAM be sufficient? How would backups work for a database this large? Is PostgreSQL even the best database engine for this app? Perhaps MySQL? Or maybe a Microsoft solution? Thanks so much, David
pdg@stratos.net wrote: > Hello, :-) > > Can someone help me spec out the hardware needed for a simple web-based > database app? > > Basically, that application needs to lookup a single row by its primary key. > This would be fairly straightforward, except that the table needs to contain > up to 15 million records. Each row will contain approximately 25 variable > length CHAR fields with perhaps a total of 3 to 4 kilobytes of data per row. > > Updates will be done nightly via some sort of a batch process. > > What kind of hardware would be needed for this sort of application? The > queries are not complex, it's just a lot of data. Databases are I/O bound...if your I/O is slow, your database will be slow. So your goal is to minimize the amount of I/O needed and the time it takes to do the I/O. You minimize the amount of I/O by getting things with big caches. You minimize the amount of time the I/O takes by using fast storage devices. This means SCSI. If you have limited funds, spend them on fast SCSI disks with big caches and a good controller instead the latest and greatest processor. If you have the funds, look into getting a RAID card with a big cache on it. A RAID 1 or 5 also helps out if a disk crashes. > Would a midrange Celeron processor with 256 MB RAM be sufficient? How would > backups work for a database this large? I would not use the Celeron. The Celeron lacks cache, and you want all the cache you can afford. One user posted benchmarks to this list showing the Celeron to be twice as slow as a PIII at the same clock speed. Buy a lot of memory. And make sure the motherboard you get can support a lot of memory (i.e., it has 4 or more slots for RAM). Backups: DDS3/4 tape is a good, but expensive choice. It's one of the few backup technologies that has cheap media and is vendor-neutral. Stay away from "Big floppies" like Jaz drives. They're too flaky. > Is PostgreSQL even the best database engine for this app? Perhaps MySQL? Or > maybe a Microsoft solution? MySQL is traditionally faster at pumping out web pages with pure speed, but PostgreSQL has been catching up very quickly. Also, PostgreSQL has traditionally been able to handle many more concurrent users, but I think MySQL has been getting better there, too. I haven't seen any banchmarks with recent versions of either database. They're both free, so try some tests with both of them (but don't forget to tune them properly). If you care anything about cost or flexibility, I wouldn't go for a Microsoft "solution". --Jeremy
On Mon, 22 Apr 2002, Jeremy Buchmann wrote: > Databases are I/O bound...if your I/O is slow, your database will be slow. > So your goal is to minimize the amount of I/O needed and the time it takes > to do the I/O. You minimize the amount of I/O by getting things with big > caches. Assuming you freqently access the same data. If you're doing essentially random queries on a 50 GB database, an extra few hundred megabytes of cache will probably make little difference. > You minimize the amount of time the I/O takes by using fast > storage devices. > This means SCSI. Not necessarially. More disk arms is also a big help, so much so that I would take two IDE drives (assuming that they're fast modern ones) over one SCSI drive any day. You probably want to at least make sure that your log files and data files are on separate disks. > If you have the funds, look into getting a RAID card with a big cache on it. > A RAID 1 or 5 also helps out if a disk crashes. A RAID 5 will slow down write performance considerably, but the added reliability might be worth the tradeoff. But mirroring is better, if you can afford it. Almost certainly you want your log file on a mirrored volume rather than on a RAID 5 volume. As far as a "big cache" goes, well, see above. Also note that if your OS caches blocks in system RAM, you will: a) have two caches in series, which is a bit of a waste, and b) n MB of cache on a raid controller will often be more expensive than the same amount as system RAM. > > Is PostgreSQL even the best database engine for this app? Perhaps MySQL? Or > > maybe a Microsoft solution? > > MySQL is traditionally faster at pumping out web pages with pure speed, but > PostgreSQL has been catching up very quickly. Also, PostgreSQL has > traditionally been able to handle many more concurrent users, but I think > MySQL has been getting better there, too. I haven't seen any banchmarks > with recent versions of either database. They're both free, so try some > tests with both of them (but don't forget to tune them properly). For this simple application, I'd say MySQL is likely to be a bit faster, simply because the database will be smaller. (MySQL has considerably less row overhead.) However, the speed difference is not likely to be too great, and certain other things might make MySQL even worse. > If you care anything about cost or flexibility, I wouldn't go for a > Microsoft "solution". Au contraire, MS SQL Server is cheap (compared to other commerical products) and is a pretty good database. I would happily use it again. But the fact that it runs under Windows only makes administration a pain, and "cheap" in the commerical world is still a heck of a lot more expensive than "free." However, given the size of your project, and what you're looking at spending, I can't see MS SQL Server or other products in that class being worthwhile. An SQL Server installation is typically a $10,000-$50,000 kind of thing. (You can do it for a lot less, but most such products that can get by with such small installations would probably be just as well off with PostgreSQL.) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > On Mon, 22 Apr 2002, Jeremy Buchmann wrote: > >>You minimize the amount of time the I/O takes by using fast >>storage devices. >>This means SCSI. > > > Not necessarially. More disk arms is also a big help, so much so > that I would take two IDE drives (assuming that they're fast modern > ones) over one SCSI drive any day. SCSI also has several other advantages...it doesn't suck up all your CPU time when there's a lot of disk activity, and it supports command queueing. IDE was designed to be cheap, SCSI was designed to be good. I've used both and from my experience, SCSI is The Way. --Jeremy
>>>>> "Jeremy" == Jeremy Buchmann <jeremy@wellsgaming.com> writes: Jeremy> Curt Sampson wrote: >> On Mon, 22 Apr 2002, Jeremy Buchmann wrote: >> >>> You minimize the amount of time the I/O takes by using fast >>> storage devices. This means SCSI. >> >> >> Not necessarially. More disk arms is also a big help, so much so >> that I would take two IDE drives (assuming that they're fast modern >> ones) over one SCSI drive any day. Jeremy> SCSI also has several other advantages...it doesn't suck up Jeremy> all your CPU time when there's a lot of disk activity, and it Jeremy> supports command queueing. IDE was designed to be cheap, SCSI Jeremy> was designed to be good. I've used both and from my Jeremy> experience, SCSI is The Way. Well... I wouldn't phrase the problem that 1-dimensionally. There are IDE drives that support queueing (IBM Deskstar something-or-other) and some OSs have ATA controller code that really kicks but (FreeBSD, for instance), so from a software point of view, IDE can work well. It's even likely that the maufacturing process is nearly identical for some makes and models. But the RMA rates for _extreme_ usage don't match. My theory: I find that IDE drives placed in situations where their request queue is populated completely for days at a time will fail between 3 and 6 months where SCSI drives can last years in this condition. Pay attention to the qualification: we're talking not even a milisecond of unused time for at least hours on end out of a day. I believe this is due to the SCSI disk having software that will momentarily stop processing requests to perform a recalibration ... where IDE drives may only do this when they are idle for some small period of time. Dave. -- ============================================================================ |David Gilbert, Velocet Communications. | Two things can only be | |Mail: dgilbert@velocet.net | equal if and only if they | |http://daveg.ca | are precisely opposite. | =========================================================GLO================
On Tue, Apr 23, 2002 at 01:44:40PM -0400, David Gilbert wrote: > I believe this is due to the SCSI disk having software that will > momentarily stop processing requests to perform a recalibration > ... where IDE drives may only do this when they are idle for some > small period of time. Another reason is that IDE drives often have higher densities. A 160 GB SCSI disk have a lot of plates (8?), while IDE drives with the same capacity typicly have only one or two. It basicly means that the disk is more sensitive to errors. (Numbers above may not be correct for all IDE and SCSI-disks, they are just meant as an illustration) This, like many other IDE/SCSI issues is not a IDE-problem per-say - it's just (like it was written earlier in the thread) that SCSI-disks are typicly designed for servers, while IDE disks are typicly designed for workstations. -- Ragnar Kjørstad Big Storage
In article <15557.40328.672259.475024@trooper.velocet.net>, David Gilbert <dgilbert@velocet.ca> wrote: >IDE drives that support queueing (IBM Deskstar something-or-other) and Aren't these the same drives that are dying left and right? mrc -- Mike Castle dalgoda@ix.netcom.com www.netcom.com/~dalgoda/ We are all of us living in the shadow of Manhattan. -- Watchmen fatal ("You are in a maze of twisty compiler features, all different"); -- gcc