Thread: Database storage
Hi - I'm running Postgres v8.3 on Fedora 10 (home machine - x86_64 architecture). After realizing that the storage requirements of one of my databases will exceed 4Tb, I wanted to see if anyone had any suggestions as to hardware setup that works well with Postgres running on Linux. I have partitioned most of the database, so older data can go on slower devices with only 400-500Gb needed on faster devices. Redundancy is a requirement. General database usage is generally low, with burst input. For the curious, here's the general profile: a) ~19,000,000 INSERT statements/ day (190 INSERTs/ commit blocks) for a total of ~10 Gb INSERT/day b) INSERTs occur sequentially (daily batch) c) Handful of SELECT statements run per day to generate a few dozen reports, but these are not resource-intensive. d) Only 3-5 database users configured. Initial thought: Use an external multi-bay eSATA case and inserting some 7200RPM+, 32Mb cache 1T+ SATA drives running RAID 5. A few specific questions: 1) Which components generally lead to bottlenecks (controller card, individual HD cache, HD rotational speed, other???) 2) Are there better solutions (broad question in terms of price/ data availability)? Budget: $500 - $750 for the storage medium. Better question: Which setup was worked well with a Postgres implementation running on a similar budget? 3) If using a similar setup, are there any particular implementation issues that are not relatively obvious? Thanks for the input (even if not Postgres-specifc, I thought this might be of interest to hobby database admins like myself running Postgres implementations).
On Thu, Jul 9, 2009 at 5:40 PM, <nabble.30.miller_2555@spamgourmet.com> wrote: > Hi - > > I'm running Postgres v8.3 on Fedora 10 (home machine - x86_64 > architecture). After realizing that the storage requirements of one of > my databases will exceed 4Tb, I wanted to see if anyone had any > suggestions as to hardware setup that works well with Postgres running > on Linux. I have partitioned most of the database, so older data can > go on slower devices with only 400-500Gb needed on faster devices. > Redundancy is a requirement. General database usage is generally low, > with burst input. For the curious, here's the general profile: > a) ~19,000,000 INSERT statements/ day (190 INSERTs/ commit blocks) > for a total of ~10 Gb INSERT/day > b) INSERTs occur sequentially (daily batch) > c) Handful of SELECT statements run per day to generate a few > dozen reports, but these are not resource-intensive. > d) Only 3-5 database users configured. > > Initial thought: Use an external multi-bay eSATA case and inserting > some 7200RPM+, 32Mb cache 1T+ SATA drives running RAID 5. While RAID-5 is generally not the best database setup, considering that your inputs will be bulk you're probably ok. Also, given that 2TB server class drives (5400rpm) are now out, and that they have about the same throughput as 1TB 7200 RPM server class drives, you might want to look into them as you can get by on fewer drives for the same storage. > A few specific questions: > 1) Which components generally lead to bottlenecks (controller > card, individual HD cache, HD rotational speed, other???) IO is almost always your bottleneck on queries over large data sets. > 2) Are there better solutions (broad question in terms of price/ > data availability)? Budget: $500 - $750 for the storage medium. Better > question: Which setup was worked well with a Postgres implementation > running on a similar budget? $750 is about what a decent RAID controller would cost you, but again it's likely that given your bulk import scenario, you're probably ok without one. In this instance, you're probably best off with software RAID than a cheap RAID card which will cost extra and probably be slower than linux software RAID.
On Fri, Jul 10, 2009 at 1:28 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote: > > $750 is about what a decent RAID controller would cost you, but again > it's likely that given your bulk import scenario, you're probably ok > without one. In this instance, you're probably best off with software > RAID than a cheap RAID card which will cost extra and probably be > slower than linux software RAID. Fwiw the main disadvantage of software raid is NOT speed -- Linux software RAID is very fast. Aside from raid-5 where it lets you offload the parity calculation there really isn't much speed benefit to hardware raid. The main advantage of hardware raid is the error handling. When you get low level errors or pull a drive a lot of consumer level controllers and their drivers don't respond very well and have long timeouts or keep retrying tragically unaware that the software raid would be able to handle recoverying. A good server-class RAID controller should handle those situations without breaking a sweat. -- greg http://mit.edu/~gsstark/resume.pdf
On Thu, Jul 9, 2009 at 7:29 PM, Greg Stark<gsstark@mit.edu> wrote: > On Fri, Jul 10, 2009 at 1:28 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote: >> >> $750 is about what a decent RAID controller would cost you, but again >> it's likely that given your bulk import scenario, you're probably ok >> without one. In this instance, you're probably best off with software >> RAID than a cheap RAID card which will cost extra and probably be >> slower than linux software RAID. > > > Fwiw the main disadvantage of software raid is NOT speed -- Linux > software RAID is very fast. Aside from raid-5 where it lets you > offload the parity calculation there really isn't much speed benefit > to hardware raid. > > The main advantage of hardware raid is the error handling. When you > get low level errors or pull a drive a lot of consumer level > controllers and their drivers don't respond very well and have long > timeouts or keep retrying tragically unaware that the software raid > would be able to handle recoverying. A good server-class RAID > controller should handle those situations without breaking a sweat. Definitely a big plus of a quality HW controller, and one of the reasons I don't scrimp on the HW controllers I put in our 24/7 servers. OTOH, if you can afford a bit of downtime to handle failures, linux software RAID works pretty well, and since quad core CPUs are now pretty much the standard, it's ok if parity calculation uses up a bit of one core for lower performing servers like the reporting server the OP was talking about.
> On Thu, Jul 9, 2009 at 7:29 PM, Greg Stark<gsstark@mit.edu> wrote: >> On Fri, Jul 10, 2009 at 1:28 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote: >>> >>> $750 is about what a decent RAID controller would cost you, but again >>> it's likely that given your bulk import scenario, you're probably ok >>> without one. In this instance, you're probably best off with software >>> RAID than a cheap RAID card which will cost extra and probably be >>> slower than linux software RAID. ... >> The main advantage of hardware raid is the error handling. When you >> get low level errors or pull a drive a lot of consumer level >> controllers and their drivers don't respond very well and have long >> timeouts or keep retrying tragically unaware that the software raid >> would be able to handle recoverying. A good server-class RAID >> controller should handle those situations without breaking a sweat. > Definitely a big plus of a quality HW controller, and one of the > reasons I don't scrimp on the HW controllers I put in our 24/7 > servers. OTOH, if you can afford a bit of downtime to handle > failures, linux software RAID works pretty well, and since quad core > CPUs are now pretty much the standard, it's ok if parity calculation > uses up a bit of one core for lower performing servers like the > reporting server the OP was talking about. The database server is a quad core machine, so it sounds as though software RAID should work fine for the present setup. However, it sounds as though I should put some money into a hardware RAID controller if the database becomes more active. I had assumed RAID-5 would be fine, but please let me know if there is another RAID level more appropriate for this implementation. Thanks for the valuable insight!
nabble.30.miller_2555@spamgourmet.com wrote: > The database server is a quad core machine, so it sounds as though > software RAID should work fine for the present setup. However, it > sounds as though I should put some money into a hardware RAID > controller if the database becomes more active. I had assumed RAID-5 > would be fine, but please let me know if there is another RAID level > more appropriate for this implementation. Thanks for the valuable > insight! > raid-5 performs very poorly on random small block writes, which is hte majority of what databases do. raid10 is the preferred raid for databases. btw: re earlier discussion of raid controllers vs software... I'm surprised nooone mentioned that a 'real' raid controller with battery backed writeback cache can hugely speed up committed 8kbyte block random writes, which are quite often the big bottleneck in a transactional database.
In response to nabble.30.miller_2555@spamgourmet.com: > > On Thu, Jul 9, 2009 at 7:29 PM, Greg Stark<gsstark@mit.edu> wrote: > >> On Fri, Jul 10, 2009 at 1:28 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote: > >>> > >>> $750 is about what a decent RAID controller would cost you, but again > >>> it's likely that given your bulk import scenario, you're probably ok > >>> without one. In this instance, you're probably best off with software > >>> RAID than a cheap RAID card which will cost extra and probably be > >>> slower than linux software RAID. > ... > >> The main advantage of hardware raid is the error handling. When you > >> get low level errors or pull a drive a lot of consumer level > >> controllers and their drivers don't respond very well and have long > >> timeouts or keep retrying tragically unaware that the software raid > >> would be able to handle recoverying. A good server-class RAID > >> controller should handle those situations without breaking a sweat. > > > Definitely a big plus of a quality HW controller, and one of the > > reasons I don't scrimp on the HW controllers I put in our 24/7 > > servers. OTOH, if you can afford a bit of downtime to handle > > failures, linux software RAID works pretty well, and since quad core > > CPUs are now pretty much the standard, it's ok if parity calculation > > uses up a bit of one core for lower performing servers like the > > reporting server the OP was talking about. > > The database server is a quad core machine, so it sounds as though > software RAID should work fine for the present setup. However, it > sounds as though I should put some money into a hardware RAID > controller if the database becomes more active. I had assumed RAID-5 > would be fine, but please let me know if there is another RAID level > more appropriate for this implementation. Thanks for the valuable > insight! RAID 10 is pretty much the fastest RAID level for disk IO. Every pair of disks you add to a RAID-10 array makes the array faster (assuming you don't hit any controller bottlenecks) Another advantage of hardware RAID controllers (that I'm surprised nobody has mentioned) is battery-backed cache. Using said cache, you can configure the controller to lie about fsyncs, which make them essentially free from PostgreSQL's standpoint. Since the cache is backed by a battery, your concerns about data loss in the event of power failure are much less. The cache doesn't usually increase the overall throughput of the system, but it usually improves peak load performance by deferring writes until things are calmer. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/
On Fri, Jul 10, 2009 at 8:43 AM, John R Pierce<pierce@hogranch.com> wrote: > nabble.30.miller_2555@spamgourmet.com wrote: >> >> The database server is a quad core machine, so it sounds as though >> software RAID should work fine for the present setup. However, it >> sounds as though I should put some money into a hardware RAID >> controller if the database becomes more active. I had assumed RAID-5 >> would be fine, but please let me know if there is another RAID level >> more appropriate for this implementation. Thanks for the valuable >> insight! >> > > raid-5 performs very poorly on random small block writes, which is hte > majority of what databases do. raid10 is the preferred raid for databases. > > > > btw: re earlier discussion of raid controllers vs software... I'm surprised > nooone mentioned that a 'real' raid controller with battery backed writeback > cache can hugely speed up committed 8kbyte block random writes, which are > quite often the big bottleneck in a transactional database. Given that the OP's usage pattern was bulk imports and reporting queries it didn't seem very important.