Thread: Hardware HD choice...

Hardware HD choice...

From
"Lionel"
Date:
Hello,

I have to choose a dedicated server to host a big 8.3 database.
The global size of the database (indexes included) will grow by 40 Go every
year (40 millions of lines/year)
Real data (indexes excluded) will be around 5-7 Go/year.
I need to store 4 years of activity.
Very few simultaneous users (~4).
100000 rows added every day via csv imports.
The application will be a reporting application.
Main statements: aggregation of 10000 to 10millions of line.
Vast majority will hit 300000 lines (90% of the connected users), few will
hit more than 10 millions (10% of the connected users, there may never be 2
simultaneous users of this category).
20sec - 30sec for such a statement is acceptable.

It's quite easy to choose CPU (xeon quad core 2.66, maybe dual xeon), RAM
(8-12Go) but I still hesitate for hard disks.

these are options possible with the hosters I usually work with:

Option 0)
RAID1 750Go SATA2

Option 1)
RAID1 750Go SATA2 + 500Go USB disk

Option 2)
RAID1 SAS 15000rpm 147 Go hard disk + 500Go USB

Option 2+)
RAID1 SATA2 SSD intel X25-M 80Go + 500Go USB

Option 3)
RAID5 SATA2 5x750Go

Option 4)
RAID10 SAS 15000rpm 4x146 Go

Option 5)
RAID10 SATA2 4x250 Go

Any other better option that I could ask for ?

What would be the best choice in case of an external USB drive : using it
for indexes or x_log ?

And what is the best option to backup such a database ?
Nightly dump ?
data folder zip ?

thanks for your help.



Re: Hardware HD choice...

From
"Scott Marlowe"
Date:
On Thu, Oct 23, 2008 at 9:10 AM, Lionel <lionel@art-informatique.com> wrote:
> Hello,
>
> I have to choose a dedicated server to host a big 8.3 database.
> The global size of the database (indexes included) will grow by 40 Go every
> year (40 millions of lines/year)
> Real data (indexes excluded) will be around 5-7 Go/year.
> I need to store 4 years of activity.
> Very few simultaneous users (~4).
> 100000 rows added every day via csv imports.
> The application will be a reporting application.
> Main statements: aggregation of 10000 to 10millions of line.
> Vast majority will hit 300000 lines (90% of the connected users), few will
> hit more than 10 millions (10% of the connected users, there may never be 2
> simultaneous users of this category).
> 20sec - 30sec for such a statement is acceptable.

In that case, throw memory at the problem first, then lots of hard
drives on a good RAID controller.

> It's quite easy to choose CPU (xeon quad core 2.66, maybe dual xeon), RAM
> (8-12Go) but I still hesitate for hard disks.

I've had better luck with opterons than Xeons, but they're both pretty
good nowadays.  I'd look at at least 16 Gigs ram, if you can afford it
get 32.

> these are options possible with the hosters I usually work with:
>
> Option 0)
> RAID1 750Go SATA2
>
> Option 1)
> RAID1 750Go SATA2 + 500Go USB disk
>
> Option 2)
> RAID1 SAS 15000rpm 147 Go hard disk + 500Go USB
>
> Option 2+)
> RAID1 SATA2 SSD intel X25-M 80Go + 500Go USB
>
> Option 3)
> RAID5 SATA2 5x750Go
>
> Option 4)
> RAID10 SAS 15000rpm 4x146 Go
>
> Option 5)
> RAID10 SATA2 4x250 Go
>
> Any other better option that I could ask for ?

Yes, more drives.  4 drives in a RAID10 is a good start.  If you could
get 8 or 12 in one that's even better.

>
> What would be the best choice in case of an external USB drive : using it
> for indexes or x_log ?

Not to use one.  Generally USB transfer speeds and external USB drives
aren't reliable or fast enough for serious server use.  I say this
with two very nice external USB drives sitting next to me.  They store
my videos, not customer data.

> And what is the best option to backup such a database ?

PITR

-- When fascism comes to America, it will be draped in a flag and
carrying a cross - Sinclair Lewis

Re: Hardware HD choice...

From
"Scott Marlowe"
Date:
On Thu, Oct 23, 2008 at 10:38 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>>
>> Any other better option that I could ask for ?
>
> Yes, more drives.  4 drives in a RAID10 is a good start.  If you could
> get 8 or 12 in one that's even better.
>

Note that for transactional databases SAS drives are usually
noticeably better, but for reporting databases, SATA drives are
generally fine, with 70-80% the sustained transfer rate at less than
half the cost per megabyte.  I'd recommend 8 SATA drives over 4 SAS
drives for a reporting database.  You'll spend about the same on twice
the number of drives but you'll get much more storage, which is often
useful when you need to work with large datasets.

Re: Hardware HD choice...

From
Ivan Voras
Date:
Lionel wrote:
> Hello,
>
> I have to choose a dedicated server to host a big 8.3 database.
> The global size of the database (indexes included) will grow by 40 Go every
> year (40 millions of lines/year)
> Real data (indexes excluded) will be around 5-7 Go/year.
> I need to store 4 years of activity.
> Very few simultaneous users (~4).
> 100000 rows added every day via csv imports.
> The application will be a reporting application.
> Main statements: aggregation of 10000 to 10millions of line.
> Vast majority will hit 300000 lines (90% of the connected users), few will
> hit more than 10 millions (10% of the connected users, there may never be 2
> simultaneous users of this category).
> 20sec - 30sec for such a statement is acceptable.
>
> It's quite easy to choose CPU (xeon quad core 2.66, maybe dual xeon), RAM
> (8-12Go) but I still hesitate for hard disks.

If the number of users is low or your queries are complex, a faster CPU
with fewer cores will serve you better because PostgreSQL cannot split a
single query across multiple CPUs/cores. It will also speed up your CSV
imports (be sure to do them as a single transaction or with COPY). If
you can bear the heat (and the electricity bill) you can get 3.4 GHz
Xeons. Be sure to use a 64-bit OS and lots of memory (16 GB+).

> Option 5)
> RAID10 SATA2 4x250 Go

Good enough.

> Any other better option that I could ask for ?

Yes, 8x250 :) You need as much drives as possible - not for capacity or
reliability but for speed. Use RAID 5 or RAID 6 only if the database
isn't going to be updated often (for example, if you add records to the
database only several times a day, it's ok).

> What would be the best choice in case of an external USB drive : using it
> for indexes or x_log ?

Skip any USB-connected drives for production environments. It's not
worth it. If new data isn't going to be added to the database
continuously, you don't need a separate x_log. Better use the drive as a
(hot, if possible) spare for RAID in case one of the RAID drives
malfunctions.




Attachment

Re: Hardware HD choice...

From
"Scott Carey"
Date:
If you are doing batch inserts of data, and want to have reporting queries concurrently running, make sure you have the pg_xlogs on a different disk than the data/indexes.   2 drives RAID 1 for OS + xlogs works great (and these can be SAS if you choose, have a separate partition -- ext2 if it is linux -- for the xlogs.  Then you can easily go with storage capacity and SATA for the main reporting portion.  You just don't want the inserts in batches to slow the whole thing to a crawl due to xlog writes on the same drive array as the reporting.
However, if you can only get a few disks, it is a lot harder to choose between one large array and two of them split without experimenting with both on real data and queries.  It is a quick and easy performance win if you have 6+ disks and do enough writes.

Also, if you intend to have lots of data organized by a time field, and expect to do the reporting/aggregation queries on subsets of that data bounded by time, partitioning by time can have huge benefits.  Partition by month, for example, and sequential scans will only flow to the months of interest if the queries have the right lmits on the date in the where clause.

Partitioning WILL take more development and tuning time, so don't do it unless you know you need it... though if the reporting is mostly restricted to time windows, the impact it has on improving runtimes of aggregation queries is immense.  However, partitioning won't help at all until you have enough data to justify it.

On Thu, Oct 23, 2008 at 10:16 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Thu, Oct 23, 2008 at 10:38 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>>
>> Any other better option that I could ask for ?
>
> Yes, more drives.  4 drives in a RAID10 is a good start.  If you could
> get 8 or 12 in one that's even better.
>

Note that for transactional databases SAS drives are usually
noticeably better, but for reporting databases, SATA drives are
generally fine, with 70-80% the sustained transfer rate at less than
half the cost per megabyte.  I'd recommend 8 SATA drives over 4 SAS
drives for a reporting database.  You'll spend about the same on twice
the number of drives but you'll get much more storage, which is often
useful when you need to work with large datasets.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Hardware HD choice...

From
"Scott Marlowe"
Date:
On Thu, Oct 23, 2008 at 8:48 PM, Scott Carey <scott@richrelevance.com> wrote:
> If you are doing batch inserts of data, and want to have reporting queries
> concurrently running, make sure you have the pg_xlogs on a different disk
> than the data/indexes.   2 drives RAID 1 for OS + xlogs works great (and

From the OPs original post I'd guess that one big RAID 10 would serve
him best, but yeah, you need to test to really see.

> Also, if you intend to have lots of data organized by a time field, and
> expect to do the reporting/aggregation queries on subsets of that data
> bounded by time, partitioning by time can have huge benefits.  Partition by
> month, for example, and sequential scans will only flow to the months of
> interest if the queries have the right lmits on the date in the where
> clause.

I second this.  Partitioning in time in past reporting databases
resulted in huge performance improvements for select queries.

Re: Hardware HD choice...

From
Lutz Steinborn
Date:
On Thu, 23 Oct 2008 23:41:49 -0600
"Scott Marlowe" <scott.marlowe@gmail.com> wrote:

> On Thu, Oct 23, 2008 at 8:48 PM, Scott Carey <scott@richrelevance.com> wrote:
> > If you are doing batch inserts of data, and want to have reporting queries
> > concurrently running, make sure you have the pg_xlogs on a different disk
> > than the data/indexes.   2 drives RAID 1 for OS + xlogs works great (and
>
> From the OPs original post I'd guess that one big RAID 10 would serve
> him best, but yeah, you need to test to really see.

Has anybody a performance comparison for postgresql between the various RAID
levels ?

many thanks in advance

regards


Re: Hardware HD choice...

From
"Lionel"
Date:
"Scott Marlowe" wrote:
> I second this.  Partitioning in time in past reporting databases
> resulted in huge performance improvements for select queries.

Most statements will load data from a single year, but multiple monthes.
I have a integer field containing the year and will use it for
partitionning.
It will also help a lot to remove one year after 4 years of activity.


Actually this same database is used with 2 millions of lines per year
(instead of 30). It is loaded with 3 years and runs quite fast
unpartitionned on a 4 years old single SCSI HD with 2Go of RAM, single core
pentium4.
It runs a LOT faster on a quad xeon 2.83GHz with 8Go of ram and SATA HD,
which is quite common now for dedicated servers.

I tried partitionning on it: it showed no performance gain for such a small
size, but it is an evidence that it will help with 30 millions of
lines/year.

OK, thanks to all your recommandations, I will ask hosters for a RAID10
4x250go SATA.