Thread: Database storage

Database storage

From
nabble.30.miller_2555@spamgourmet.com
Date:
Hi -

I'm running Postgres v8.3 on Fedora 10 (home machine - x86_64
architecture). After realizing that the storage requirements of one of
my databases will exceed 4Tb, I wanted to see if anyone had any
suggestions as to hardware setup that works well with Postgres running
on Linux. I have partitioned most of the database, so older data can
go on slower devices with only 400-500Gb needed on faster devices.
Redundancy is a requirement. General database usage is generally low,
with burst input. For the curious, here's the general profile:
    a) ~19,000,000 INSERT statements/ day (190 INSERTs/ commit blocks)
for a total of ~10 Gb INSERT/day
    b) INSERTs occur sequentially (daily batch)
    c) Handful of SELECT statements run per day to generate a few
dozen reports, but these are not resource-intensive.
    d) Only 3-5 database users configured.

Initial thought: Use an external multi-bay eSATA case and inserting
some 7200RPM+, 32Mb cache 1T+ SATA drives running RAID 5.

A few specific questions:
    1) Which components generally lead to bottlenecks (controller
card, individual HD cache, HD rotational speed, other???)
    2) Are there better solutions (broad question in terms of price/
data availability)? Budget: $500 - $750 for the storage medium. Better
question: Which setup was worked well with a Postgres implementation
running on a similar budget?
    3) If using a similar setup, are there any particular
implementation issues that are not relatively obvious?

Thanks for the input (even if not Postgres-specifc, I thought this
might be of interest to hobby database admins like myself running
Postgres implementations).


Re: Database storage

From
Scott Marlowe
Date:
On Thu, Jul 9, 2009 at 5:40 PM, <nabble.30.miller_2555@spamgourmet.com> wrote:
> Hi -
>
> I'm running Postgres v8.3 on Fedora 10 (home machine - x86_64
> architecture). After realizing that the storage requirements of one of
> my databases will exceed 4Tb, I wanted to see if anyone had any
> suggestions as to hardware setup that works well with Postgres running
> on Linux. I have partitioned most of the database, so older data can
> go on slower devices with only 400-500Gb needed on faster devices.
> Redundancy is a requirement. General database usage is generally low,
> with burst input. For the curious, here's the general profile:
>    a) ~19,000,000 INSERT statements/ day (190 INSERTs/ commit blocks)
> for a total of ~10 Gb INSERT/day
>    b) INSERTs occur sequentially (daily batch)
>    c) Handful of SELECT statements run per day to generate a few
> dozen reports, but these are not resource-intensive.
>    d) Only 3-5 database users configured.
>
> Initial thought: Use an external multi-bay eSATA case and inserting
> some 7200RPM+, 32Mb cache 1T+ SATA drives running RAID 5.

While RAID-5 is generally not the best database setup, considering
that your inputs will be bulk you're probably ok.  Also, given that
2TB server class drives (5400rpm) are now out, and that they have
about the same throughput as 1TB 7200 RPM server class drives, you
might want to look into them as you can get by on fewer drives for the
same storage.

> A few specific questions:
>    1) Which components generally lead to bottlenecks (controller
> card, individual HD cache, HD rotational speed, other???)

IO is almost always your bottleneck on queries over large data sets.

>    2) Are there better solutions (broad question in terms of price/
> data availability)? Budget: $500 - $750 for the storage medium. Better
> question: Which setup was worked well with a Postgres implementation
> running on a similar budget?

$750 is about what a decent RAID controller would cost you, but again
it's likely that given your bulk import scenario,  you're probably ok
without one.  In this instance, you're probably best off with software
RAID than a cheap RAID card which will cost extra and probably be
slower than linux software RAID.

Re: Database storage

From
Greg Stark
Date:
On Fri, Jul 10, 2009 at 1:28 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote:
>
> $750 is about what a decent RAID controller would cost you, but again
> it's likely that given your bulk import scenario,  you're probably ok
> without one.  In this instance, you're probably best off with software
> RAID than a cheap RAID card which will cost extra and probably be
> slower than linux software RAID.


Fwiw the main disadvantage of software raid is NOT speed -- Linux
software RAID is very fast. Aside from raid-5 where it lets you
offload the parity calculation there really isn't much speed benefit
to hardware raid.

The main advantage of hardware raid is the error handling. When you
get low level errors or pull a drive a lot of consumer level
controllers and their drivers don't respond very well and have long
timeouts or keep retrying tragically unaware that the software raid
would be able to handle recoverying. A good server-class RAID
controller should handle those situations without breaking a sweat.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Database storage

From
Scott Marlowe
Date:
On Thu, Jul 9, 2009 at 7:29 PM, Greg Stark<gsstark@mit.edu> wrote:
> On Fri, Jul 10, 2009 at 1:28 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote:
>>
>> $750 is about what a decent RAID controller would cost you, but again
>> it's likely that given your bulk import scenario,  you're probably ok
>> without one.  In this instance, you're probably best off with software
>> RAID than a cheap RAID card which will cost extra and probably be
>> slower than linux software RAID.
>
>
> Fwiw the main disadvantage of software raid is NOT speed -- Linux
> software RAID is very fast. Aside from raid-5 where it lets you
> offload the parity calculation there really isn't much speed benefit
> to hardware raid.
>
> The main advantage of hardware raid is the error handling. When you
> get low level errors or pull a drive a lot of consumer level
> controllers and their drivers don't respond very well and have long
> timeouts or keep retrying tragically unaware that the software raid
> would be able to handle recoverying. A good server-class RAID
> controller should handle those situations without breaking a sweat.

Definitely a big plus of a quality HW controller, and one of the
reasons I don't scrimp on the HW controllers I put in our 24/7
servers.  OTOH, if you can afford a bit of downtime to handle
failures, linux software RAID works pretty well, and since quad core
CPUs are now pretty much the standard, it's ok if parity calculation
uses up a bit of one core for lower performing servers like the
reporting server the OP was talking about.

Re: Database storage

From
nabble.30.miller_2555@spamgourmet.com
Date:
> On Thu, Jul 9, 2009 at 7:29 PM, Greg Stark<gsstark@mit.edu> wrote:
>> On Fri, Jul 10, 2009 at 1:28 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote:
>>>
>>> $750 is about what a decent RAID controller would cost you, but again
>>> it's likely that given your bulk import scenario,  you're probably ok
>>> without one.  In this instance, you're probably best off with software
>>> RAID than a cheap RAID card which will cost extra and probably be
>>> slower than linux software RAID.
...
>> The main advantage of hardware raid is the error handling. When you
>> get low level errors or pull a drive a lot of consumer level
>> controllers and their drivers don't respond very well and have long
>> timeouts or keep retrying tragically unaware that the software raid
>> would be able to handle recoverying. A good server-class RAID
>> controller should handle those situations without breaking a sweat.

> Definitely a big plus of a quality HW controller, and one of the
> reasons I don't scrimp on the HW controllers I put in our 24/7
> servers.  OTOH, if you can afford a bit of downtime to handle
> failures, linux software RAID works pretty well, and since quad core
> CPUs are now pretty much the standard, it's ok if parity calculation
> uses up a bit of one core for lower performing servers like the
> reporting server the OP was talking about.

The database server is a quad core machine, so it sounds as though
software RAID should work fine for the present setup. However, it
sounds as though I should put some money into a hardware RAID
controller if the database becomes more active. I had assumed RAID-5
would be fine, but please let me know if there is another RAID level
more appropriate for this implementation. Thanks for the valuable
insight!


Re: Database storage

From
John R Pierce
Date:
nabble.30.miller_2555@spamgourmet.com wrote:
> The database server is a quad core machine, so it sounds as though
> software RAID should work fine for the present setup. However, it
> sounds as though I should put some money into a hardware RAID
> controller if the database becomes more active. I had assumed RAID-5
> would be fine, but please let me know if there is another RAID level
> more appropriate for this implementation. Thanks for the valuable
> insight!
>

raid-5 performs very poorly on random small block writes, which is hte
majority of what databases do.   raid10 is the preferred raid for databases.



btw: re earlier discussion of raid controllers vs software... I'm
surprised nooone mentioned that a 'real' raid controller with battery
backed writeback cache can hugely speed up committed 8kbyte block random
writes, which are quite often the big bottleneck in a transactional
database.



Re: Database storage

From
Bill Moran
Date:
In response to nabble.30.miller_2555@spamgourmet.com:

> > On Thu, Jul 9, 2009 at 7:29 PM, Greg Stark<gsstark@mit.edu> wrote:
> >> On Fri, Jul 10, 2009 at 1:28 AM, Scott Marlowe<scott.marlowe@gmail.com> wrote:
> >>>
> >>> $750 is about what a decent RAID controller would cost you, but again
> >>> it's likely that given your bulk import scenario,  you're probably ok
> >>> without one.  In this instance, you're probably best off with software
> >>> RAID than a cheap RAID card which will cost extra and probably be
> >>> slower than linux software RAID.
> ...
> >> The main advantage of hardware raid is the error handling. When you
> >> get low level errors or pull a drive a lot of consumer level
> >> controllers and their drivers don't respond very well and have long
> >> timeouts or keep retrying tragically unaware that the software raid
> >> would be able to handle recoverying. A good server-class RAID
> >> controller should handle those situations without breaking a sweat.
>
> > Definitely a big plus of a quality HW controller, and one of the
> > reasons I don't scrimp on the HW controllers I put in our 24/7
> > servers.  OTOH, if you can afford a bit of downtime to handle
> > failures, linux software RAID works pretty well, and since quad core
> > CPUs are now pretty much the standard, it's ok if parity calculation
> > uses up a bit of one core for lower performing servers like the
> > reporting server the OP was talking about.
>
> The database server is a quad core machine, so it sounds as though
> software RAID should work fine for the present setup. However, it
> sounds as though I should put some money into a hardware RAID
> controller if the database becomes more active. I had assumed RAID-5
> would be fine, but please let me know if there is another RAID level
> more appropriate for this implementation. Thanks for the valuable
> insight!

RAID 10 is pretty much the fastest RAID level for disk IO.  Every pair
of disks you add to a RAID-10 array makes the array faster (assuming
you don't hit any controller bottlenecks)

Another advantage of hardware RAID controllers (that I'm surprised nobody
has mentioned) is battery-backed cache.  Using said cache, you can
configure the controller to lie about fsyncs, which make them essentially
free from PostgreSQL's standpoint.  Since the cache is backed by a
battery, your concerns about data loss in the event of power failure are
much less.  The cache doesn't usually increase the overall throughput of
the system, but it usually improves peak load performance by deferring
writes until things are calmer.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: Database storage

From
Scott Marlowe
Date:
On Fri, Jul 10, 2009 at 8:43 AM, John R Pierce<pierce@hogranch.com> wrote:
> nabble.30.miller_2555@spamgourmet.com wrote:
>>
>> The database server is a quad core machine, so it sounds as though
>> software RAID should work fine for the present setup. However, it
>> sounds as though I should put some money into a hardware RAID
>> controller if the database becomes more active. I had assumed RAID-5
>> would be fine, but please let me know if there is another RAID level
>> more appropriate for this implementation. Thanks for the valuable
>> insight!
>>
>
> raid-5 performs very poorly on random small block writes, which is hte
> majority of what databases do.   raid10 is the preferred raid for databases.
>
>
>
> btw: re earlier discussion of raid controllers vs software... I'm surprised
> nooone mentioned that a 'real' raid controller with battery backed writeback
> cache can hugely speed up committed 8kbyte block random writes, which are
> quite often the big bottleneck in a transactional database.

Given that the OP's usage pattern was bulk imports and reporting
queries it didn't seem very important.