Thread: New server setup

New server setup

From
Niels Kristian Schjødt
Date:
Hi, I'm going to setup a new server for my postgresql database, and I am considering one of these:
http://www.hetzner.de/hosting/produkte_rootserver/poweredge-r720with four SAS drives in a RAID 10 array. Has any of you
anyparticular comments/pitfalls/etc. to mention on the setup? My application is very write heavy. 



Re: New server setup

From
Craig James
Date:
On Fri, Mar 1, 2013 at 3:43 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
Hi, I'm going to setup a new server for my postgresql database, and I am considering one of these: http://www.hetzner.de/hosting/produkte_rootserver/poweredge-r720 with four SAS drives in a RAID 10 array. Has any of you any particular comments/pitfalls/etc. to mention on the setup? My application is very write heavy.

I can only tell you our experience with Dell from several years ago.  We bought two Dell servers similar (somewhat larger) than the model you're looking at.  We'll never buy from them again.

Advantages:  They work.  They haven't failed.

Disadvantages:

Performance sucks.  Dell costs far more than "white box" servers we buy from a "white box" supplier (ASA Computers).  ASA gives us roughly double the performance for the same price.  We can buy exactly what we want from ASA.

Dell did a disk-drive "lock in."  The RAID controller won't spin up a non-Dell disk.  They wanted roughly four times the price for their disks compared to buying the exact same disks on Amazon.  If a disk went out today, it would probably cost even more because that model is obsolete (luckily, we bought a couple spares).  I think they abandoned this policy because it caused so many complaints, but you should check before you buy. This was an incredibly stupid RAID controller design.

Dell tech support doesn't know what they're talking about when it comes to RAID controllers and serious server support.  You're better off with a white-box solution, where you can buy the exact parts recommended in this group and get technical advice from people who know what they're talking about.  Dell basically doesn't understand Postgres.

They boast excellent on-site service, but for the price of their computers and their service contract, you can buy two servers from a white-box vendor.  Our white-box servers have been just as reliable as the Dell servers -- no failures.

I'm sure someone in Europe can recommend a good vendor for you.

Craig James
 



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: New server setup

From
Wales Wang
Date:
pls choice PCI-E Flash for written heavy app

Wales

在 2013-3-1,下午8:43,Niels Kristian Schjødt <nielskristian@autouncle.com> 写道:

> Hi, I'm going to setup a new server for my postgresql database, and I am considering one of these:
http://www.hetzner.de/hosting/produkte_rootserver/poweredge-r720with four SAS drives in a RAID 10 array. Has any of you
anyparticular comments/pitfalls/etc. to mention on the setup? My application is very write heavy. 
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: New server setup

From
Niels Kristian Schjødt
Date:
Thanks both of you for your input.

Earlier I have been discussing my extremely high IO wait with you here on the mailing list, and have tried a lot of tweaks both on postgresql config, wal directly location and kernel tweaks, but unfortunately my problem persists, and I think I'm eventually down to just bad hardware (currently two 7200rpm disks in a software raid 1). So changing to 4 15000rpm SAS disks in a raid 10 is probably going to change a lot - don't you think? However, we are running a lot of background processing 300 connections to db sometimes. So my question is, should I also get something like pgpool2 setup at the same time? Is it, from your experience, likely to increase my throughput a lot more, if I had a connection pool of eg. 20 connections, instead of 300 concurrent ones directly?

Den 01/03/2013 kl. 16.28 skrev Craig James <cjames@emolecules.com>:

On Fri, Mar 1, 2013 at 3:43 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
Hi, I'm going to setup a new server for my postgresql database, and I am considering one of these: http://www.hetzner.de/hosting/produkte_rootserver/poweredge-r720 with four SAS drives in a RAID 10 array. Has any of you any particular comments/pitfalls/etc. to mention on the setup? My application is very write heavy.

I can only tell you our experience with Dell from several years ago.  We bought two Dell servers similar (somewhat larger) than the model you're looking at.  We'll never buy from them again.

Advantages:  They work.  They haven't failed.

Disadvantages:

Performance sucks.  Dell costs far more than "white box" servers we buy from a "white box" supplier (ASA Computers).  ASA gives us roughly double the performance for the same price.  We can buy exactly what we want from ASA.

Dell did a disk-drive "lock in."  The RAID controller won't spin up a non-Dell disk.  They wanted roughly four times the price for their disks compared to buying the exact same disks on Amazon.  If a disk went out today, it would probably cost even more because that model is obsolete (luckily, we bought a couple spares).  I think they abandoned this policy because it caused so many complaints, but you should check before you buy. This was an incredibly stupid RAID controller design.

Dell tech support doesn't know what they're talking about when it comes to RAID controllers and serious server support.  You're better off with a white-box solution, where you can buy the exact parts recommended in this group and get technical advice from people who know what they're talking about.  Dell basically doesn't understand Postgres.

They boast excellent on-site service, but for the price of their computers and their service contract, you can buy two servers from a white-box vendor.  Our white-box servers have been just as reliable as the Dell servers -- no failures.

I'm sure someone in Europe can recommend a good vendor for you.

Craig James
 



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: New server setup

From
Kevin Grittner
Date:
Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

> So my question is, should I also get something like pgpool2 setup
> at the same time? Is it, from your experience, likely to increase
> my throughput a lot more, if I had a connection pool of eg. 20
> connections, instead of 300 concurrent ones directly?

In my experience, it can make a big difference.  If you are just
using the pooler for this reason, and don't need any of the other
features of pgpool, I suggest pgbouncer.  It is a simpler, more
lightweight tool.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: New server setup

From
Scott Marlowe
Date:
On Tue, Mar 5, 2013 at 9:34 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
>
>> So my question is, should I also get something like pgpool2 setup
>> at the same time? Is it, from your experience, likely to increase
>> my throughput a lot more, if I had a connection pool of eg. 20
>> connections, instead of 300 concurrent ones directly?
>
> In my experience, it can make a big difference.  If you are just
> using the pooler for this reason, and don't need any of the other
> features of pgpool, I suggest pgbouncer.  It is a simpler, more
> lightweight tool.

I second the pgbouncer rec.


Re: New server setup

From
Niels Kristian Schjødt
Date:
Thanks, that was actually what I just ended up doing yesterday. Any suggestion how to tune pgbouncer?

BTW, I have just bumped into an issue that caused me to disable pgbouncer again actually. My web application is
queryingthe database with a per request based SEARCH_PATH. This is because I use schemas to provide country based
separationof my data (e.g. english, german, danish data in different schemas). I have pgbouncer setup to have a
transactionalbehavior (pool_mode = transaction) - however some of my colleagues complained that it sometimes didn't
returndata from the right schema set in the SEARCH_PATH - you wouldn't by chance have any idea what is going wrong
wouldn'tyou? 

#################### pgbouncer.ini
[databases]
production =

[pgbouncer]

logfile = /var/log/pgbouncer/pgbouncer.log
pidfile = /var/run/pgbouncer/pgbouncer.pid
listen_addr = localhost
listen_port = 6432
unix_socket_dir = /var/run/postgresql
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
admin_users = postgres
pool_mode = transaction
server_reset_query = DISCARD ALL
max_client_conn = 500
default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 10
#####################


Den 05/03/2013 kl. 17.34 skrev Kevin Grittner <kgrittn@ymail.com>:

> Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
>
>> So my question is, should I also get something like pgpool2 setup
>> at the same time? Is it, from your experience, likely to increase
>> my throughput a lot more, if I had a connection pool of eg. 20
>> connections, instead of 300 concurrent ones directly?
>
> In my experience, it can make a big difference.  If you are just
> using the pooler for this reason, and don't need any of the other
> features of pgpool, I suggest pgbouncer.  It is a simpler, more
> lightweight tool.
>
> --
> Kevin Grittner
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: New server setup

From
"Benjamin Krajmalnik"
Date:
Set it to use session.  I had a similar issue having moved one of the components of our app to use transactions, which
introducedan undesired behavior. 


-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Niels
KristianSchjødt 
Sent: Tuesday, March 05, 2013 10:12 AM
To: Kevin Grittner
Cc: Craig James; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server setup

Thanks, that was actually what I just ended up doing yesterday. Any suggestion how to tune pgbouncer?

BTW, I have just bumped into an issue that caused me to disable pgbouncer again actually. My web application is
queryingthe database with a per request based SEARCH_PATH. This is because I use schemas to provide country based
separationof my data (e.g. english, german, danish data in different schemas). I have pgbouncer setup to have a
transactionalbehavior (pool_mode = transaction) - however some of my colleagues complained that it sometimes didn't
returndata from the right schema set in the SEARCH_PATH - you wouldn't by chance have any idea what is going wrong
wouldn'tyou? 

#################### pgbouncer.ini
[databases]
production =

[pgbouncer]

logfile = /var/log/pgbouncer/pgbouncer.log pidfile = /var/run/pgbouncer/pgbouncer.pid listen_addr = localhost
listen_port= 6432 unix_socket_dir = /var/run/postgresql auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt
admin_users= postgres pool_mode = transaction server_reset_query = DISCARD ALL max_client_conn = 500 default_pool_size
=20 reserve_pool_size = 5 reserve_pool_timeout = 10 ##################### 


Den 05/03/2013 kl. 17.34 skrev Kevin Grittner <kgrittn@ymail.com>:

> Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
>
>> So my question is, should I also get something like pgpool2 setup at
>> the same time? Is it, from your experience, likely to increase my
>> throughput a lot more, if I had a connection pool of eg. 20
>> connections, instead of 300 concurrent ones directly?
>
> In my experience, it can make a big difference.  If you are just using
> the pooler for this reason, and don't need any of the other features
> of pgpool, I suggest pgbouncer.  It is a simpler, more lightweight
> tool.
>
> --
> Kevin Grittner
> EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
> Company



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: New server setup

From
Niels Kristian Schjødt
Date:
Okay, thanks - but hey - if I put it at session pooling, then it says in the documentation: "default_pool_size: In session pooling it needs to be the number of max clients you want to handle at any moment". So as I understand it, is it true that I then have to set default_pool_size to 300 if I have up to 300 client connections? And then what would the pooler then help on my performance - would that just be exactly like having the 300 clients connect directly to the database???

-NK


Den 05/03/2013 kl. 19.03 skrev "Benjamin Krajmalnik" <kraj@servoyant.com>:

 

Re: New server setup

From
Jeff Janes
Date:
On Tue, Mar 5, 2013 at 10:27 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
Okay, thanks - but hey - if I put it at session pooling, then it says in the documentation: "default_pool_size: In session pooling it needs to be the number of max clients you want to handle at any moment". So as I understand it, is it true that I then have to set default_pool_size to 300 if I have up to 300 client connections?

If those 300 client connections are all long-lived, then yes you need that many in the pool.  If they are short-lived connections, then you can have a lot less as any ones over the default_pool_size will simply block until an existing connection is closed and can be re-assigned--which won't take long if they are short-lived connections.


And then what would the pooler then help on my performance - would that just be exactly like having the 300 clients connect directly to the database???

It would probably be even worse than having 300 clients connected directly.  There would be no point in using a pooler under those conditions.

 
Cheers,

Jeff

Re: New server setup

From
Gregg Jaskiewicz
Date:
In my recent experience PgPool2 performs pretty badly as a pooler. I'd avoid it if possible, unless you depend on other features. 
It simply doesn't scale. 



On 5 March 2013 21:59, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Mar 5, 2013 at 10:27 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
Okay, thanks - but hey - if I put it at session pooling, then it says in the documentation: "default_pool_size: In session pooling it needs to be the number of max clients you want to handle at any moment". So as I understand it, is it true that I then have to set default_pool_size to 300 if I have up to 300 client connections?

If those 300 client connections are all long-lived, then yes you need that many in the pool.  If they are short-lived connections, then you can have a lot less as any ones over the default_pool_size will simply block until an existing connection is closed and can be re-assigned--which won't take long if they are short-lived connections.


And then what would the pooler then help on my performance - would that just be exactly like having the 300 clients connect directly to the database???

It would probably be even worse than having 300 clients connected directly.  There would be no point in using a pooler under those conditions.

 
Cheers,

Jeff



--
GJ

Re: New server setup

From
Greg Smith
Date:
On 3/1/13 6:43 AM, Niels Kristian Schjødt wrote:
> Hi, I'm going to setup a new server for my postgresql database, and I am considering one of these:
http://www.hetzner.de/hosting/produkte_rootserver/poweredge-r720with four SAS drives in a RAID 10 array. Has any of you
anyparticular comments/pitfalls/etc. to mention on the setup? My application is very write heavy. 

The Dell PERC H710 (actually a LSI controller) works fine for
write-heavy workloads on a RAID 10, as long as you order it with a
battery backup unit module.  Someone must install the controller
management utility and do three things however:

1) Make sure the battery-backup unit is working.

2) Configure the controller so that the *disk* write cache is off.

3) Set the controller cache to "write-back when battery is available".
That will use the cache when it is safe to do so, and if not it will
bypass it.  That will make the server slow down if the battery fails,
but it won't ever become unsafe at writing.

See http://wiki.postgresql.org/wiki/Reliable_Writes for more information
about this topic.  If you'd like some consulting help with making sure
the server is working safely and as fast as it should be, 2ndQuadrant
does offer a hardware benchmarking service to do that sort of thing:
http://www.2ndquadrant.com/en/hardware-benchmarking/  I think we're even
generating those reports in German now.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: New server setup

From
Gregg Jaskiewicz
Date:
On 10 March 2013 15:58, Greg Smith <greg@2ndquadrant.com> wrote:
On 3/1/13 6:43 AM, Niels Kristian Schjødt wrote:
Hi, I'm going to setup a new server for my postgresql database, and I am considering one of these: http://www.hetzner.de/hosting/produkte_rootserver/poweredge-r720 with four SAS drives in a RAID 10 array. Has any of you any particular comments/pitfalls/etc. to mention on the setup? My application is very write heavy.

The Dell PERC H710 (actually a LSI controller) works fine for write-heavy workloads on a RAID 10, as long as you order it with a battery backup unit module.  Someone must install the controller management utility and do three things however:

We're going to go with either HP or IBM (customer's preference, etc). 

 
1) Make sure the battery-backup unit is working.

2) Configure the controller so that the *disk* write cache is off.

3) Set the controller cache to "write-back when battery is available". That will use the cache when it is safe to do so, and if not it will bypass it.  That will make the server slow down if the battery fails, but it won't ever become unsafe at writing.

See http://wiki.postgresql.org/wiki/Reliable_Writes for more information about this topic.  If you'd like some consulting help with making sure the server is working safely and as fast as it should be, 2ndQuadrant does offer a hardware benchmarking service to do that sort of thing: http://www.2ndquadrant.com/en/hardware-benchmarking/  I think we're even generating those reports in German now.


Thanks Greg. I will follow advice there, and also the one in your book. I do always make sure they order battery backed cache (or flash based, which seems to be what people use these days). 

I think subject of using external help with setting things up did came up, but more around connection pooling subject then hardware itself (shortly, pgpool2 is crap, we will go with dns based solution and apps connection directly to nodes). 
I will let my clients (doing this on a contract) know that there's an option to get you guys to help us. Mind you, this database is rather small in grand scheme of things (30-40GB). Just possibly a lot of occasional writes.

We wouldn't need German. But Proper English (i.e. british english) would always be nice ;)


Whilst on the hardware subject, someone mentioned throwing ssd into the mix. I.e. combining spinning HDs with SSD, apparently some raid cards can use small-ish (80GB+) SSDs as external caches. Any experiences with that ?


Thanks !

 


--
GJ

Re: New server setup

From
John Lister
Date:
On 12/03/2013 21:41, Gregg Jaskiewicz wrote:
>
> Whilst on the hardware subject, someone mentioned throwing ssd into
> the mix. I.e. combining spinning HDs with SSD, apparently some raid
> cards can use small-ish (80GB+) SSDs as external caches. Any
> experiences with that ?
>
The new LSI/Dell cards do this (eg H710 as mentioned in an earlier
post). It is easy to set up and supported it seems on all versions of
dells cards even if the docs say it isn't. Works well with the limited
testing I did, switched to pretty much all SSDs drives in my current setup

These cards also supposedly support enhanced performance with just SSDs
(CTIO) by playing with the cache settings, but to be honest I haven't
noticed any difference and I'm not entirely sure it is enabled as there
is no indication that CTIO is actually enabled and working.

John


Re: New server setup

From
Greg Jaskiewicz
Date:
On 13 Mar 2013, at 15:33, John Lister <john.lister@kickstone.com> wrote:

> On 12/03/2013 21:41, Gregg Jaskiewicz wrote:
>>
>> Whilst on the hardware subject, someone mentioned throwing ssd into the mix. I.e. combining spinning HDs with SSD,
apparentlysome raid cards can use small-ish (80GB+) SSDs as external caches. Any experiences with that ? 
>>
> The new LSI/Dell cards do this (eg H710 as mentioned in an earlier post). It is easy to set up and supported it seems
onall versions of dells cards even if the docs say it isn't. Works well with the limited testing I did, switched to
prettymuch all SSDs drives in my current setup 
>
> These cards also supposedly support enhanced performance with just SSDs (CTIO) by playing with the cache settings,
butto be honest I haven't noticed any difference and I'm not entirely sure it is enabled as there is no indication that
CTIOis actually enabled and working. 
>
SSDs have much shorter life then spinning drives, so what do you do when one inevitably fails in your system ?

Re: New server setup

From
John Lister
Date:
On 13/03/2013 15:50, Greg Jaskiewicz wrote:
> SSDs have much shorter life then spinning drives, so what do you do when one inevitably fails in your system ?
Define much shorter? I accept they have a limited no of writes, but that
depends on load. You can actively monitor the drives "health" level in
terms of wear using smart and it is relatively straightforward to
calculate an estimate of life based on average use and for me that works
out at about in excess of 5 years. Experience tells me that spinning
drives have a habit of failing in that time frame as well :( and in 5
years I'll be replacing the server probably.

I also overprovisioned the drives by about an extra 13% giving me 20%
spare capacity when adding in the 7% manufacturer spare space - given
this currently my drives have written about 4TB of data each and show 0%
wear, this is for 160GB drives. I actively monitor the wear level and
plan to replace the drives when they get low. For a comparison of write
levels see
http://www.xtremesystems.org/forums/showthread.php?271063-SSD-Write-Endurance-25nm-Vs-34nm,
it shows for the 320series that it reported to have hit the wear limit
at 190TB (for a drive 1/4 the size of mine) but actually managed nearer
700TB before the drive failed.

I've mixed 2 different manufacturers in my raid 10 pairs to mitigate
against both pairs failing at the same time either due to a firmware bug
or being full In addition when I was setting the box up I did some
performance testing against the drives but with using different
combinations for each test - the aim here is to pre-load each drive
differently to prevent them failing when full simultaneously.

If you do go for raid 10 make sure to have a power fail endurance, ie
capacitor or battery on the drive.

John


Re: New server setup

From
Steve Crawford
Date:
On 03/13/2013 09:15 AM, John Lister wrote:
> On 13/03/2013 15:50, Greg Jaskiewicz wrote:
>> SSDs have much shorter life then spinning drives, so what do you do
>> when one inevitably fails in your system ?
> Define much shorter? I accept they have a limited no of writes, but
> that depends on load. You can actively monitor the drives "health"
> level...

What concerns me more than wear is this:

InfoWorld Article:
http://www.infoworld.com/t/solid-state-drives/test-your-ssds-or-risk-massive-data-loss-researchers-warn-213715

Referenced research paper:
https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault

Kind of messes with the "D" in ACID.

Cheers,
Steve



Re: New server setup

From
Karl Denninger
Date:

On 3/13/2013 2:23 PM, Steve Crawford wrote:
On 03/13/2013 09:15 AM, John Lister wrote:
On 13/03/2013 15:50, Greg Jaskiewicz wrote:
SSDs have much shorter life then spinning drives, so what do you do when one inevitably fails in your system ?
Define much shorter? I accept they have a limited no of writes, but that depends on load. You can actively monitor the drives "health" level...

What concerns me more than wear is this:

InfoWorld Article:
http://www.infoworld.com/t/solid-state-drives/test-your-ssds-or-risk-massive-data-loss-researchers-warn-213715

Referenced research paper:
https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault

Kind of messes with the "D" in ACID.

Cheers,
Steve

One potential way around this is to run ZFS as the underlying filesystem and use the SSDs as cache drives.  If they lose data due to a power problem it is non-destructive.

Short of that you cannot use a SSD on a machine where silent corruption is unacceptable UNLESS you know it has a supercap or similar IN THE DISK that guarantees that on-drive cache can be flushed in the event of a power failure.  A battery-backed controller cache DOES NOTHING to alleviate this risk.  If you violate this rule and the power goes off you must EXPECT silent and possibly-catastrophic data corruption.

Only a few (and they're expensive!) SSD drives have said protection.  If yours does not the only SAFE option is as I described up above using them as ZFS cache devices.

--
-- Karl Denninger
The Market Ticker ®
Cuda Systems LLC

Re: New server setup

From
CSS
Date:
On Mar 13, 2013, at 3:23 PM, Steve Crawford wrote:

> On 03/13/2013 09:15 AM, John Lister wrote:
>> On 13/03/2013 15:50, Greg Jaskiewicz wrote:
>>> SSDs have much shorter life then spinning drives, so what do you do when one inevitably fails in your system ?
>> Define much shorter? I accept they have a limited no of writes, but that depends on load. You can actively monitor
thedrives "health" level... 
>
> What concerns me more than wear is this:
>
> InfoWorld Article:
> http://www.infoworld.com/t/solid-state-drives/test-your-ssds-or-risk-massive-data-loss-researchers-warn-213715
>
> Referenced research paper:
> https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault
>
> Kind of messes with the "D" in ACID.

Have a look at this:

http://blog.2ndquadrant.com/intel_ssd_now_off_the_sherr_sh/

I'm not sure what other ssds offer this, but Intel's newest entry will, and it's attractively priced.

Another way we leverage SSDs that can be more reliable in the face of total SSD meltdown is to use them as ZFS Intent
Logcaches.  All the sync writes get handled on the SSDs.  We deploy them as mirrored vdevs, so if one fails, we're OK.
Ifboth fail, we're really slow until someone can replace them.  On modest hardware, I was able to get about 20K TPS out
ofpgbench with the SSDs configured as ZIL and 4 10K raptors as the spinny disks. 

In either case, the amount of money you'd have to spend on the two-dozen or so SAS drives (and the controllers,
enclosure,etc.) that would equal a few pairs of SSDs in random IO performance is non-trivial, even if you plan on
proactivelyretiring your SSDs every year. 

Just another take on the issue..

Charles

>
> Cheers,
> Steve
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



Re: New server setup

From
John Lister
Date:
On 13/03/2013 19:23, Steve Crawford wrote:
> On 03/13/2013 09:15 AM, John Lister wrote:
>> On 13/03/2013 15:50, Greg Jaskiewicz wrote:
>>> SSDs have much shorter life then spinning drives, so what do you do
>>> when one inevitably fails in your system ?
>> Define much shorter? I accept they have a limited no of writes, but
>> that depends on load. You can actively monitor the drives "health"
>> level...
>
> What concerns me more than wear is this:
>
> InfoWorld Article:
> http://www.infoworld.com/t/solid-state-drives/test-your-ssds-or-risk-massive-data-loss-researchers-warn-213715
>
When I read this they didn't name the drives that failed - or those that
passed. But I'm assuming the failed ones are standard consumer SSDS, but
2 good ones were either enterprise of had caps. The reason I say this,
is that yes SSD drives by the nature of their operation cache/store
information in ram while they write it to the flash and to handle the
mappings, etc of real to virtual sectors and if they loose power it is
this that is lost, causing at best corruption if not complete loss of
the drive. Enterprise drives (and some consumer, such as the 320s) have
either capacitors or battery backup to allows the drive to safely
shutdown. There have been various reports both on this list and
elsewhere showing that these drives successfully survive repeated power
failures.

A bigger concern is the state of the firmware in these drives which
until recently was more likely to trash your drive - fortunately things
seems to becoming more stable with age now.

John


Re: New server setup

From
David Boreham
Date:
On 3/13/2013 1:23 PM, Steve Crawford wrote:
>
> What concerns me more than wear is this:
>
> InfoWorld Article:
> http://www.infoworld.com/t/solid-state-drives/test-your-ssds-or-risk-massive-data-loss-researchers-warn-213715
>
>
> Referenced research paper:
> https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault
>
>
> Kind of messes with the "D" in ACID.

It is somewhat surprising to discover that many SSD products are not
durable under sudden power loss (what where they thinking!?, and ...why
doesn't anyone care??).

However, there is a set of SSD types known to be designed to address
power loss events that have been tested by contributors to this list.
Use only those devices and you won't see this problem. SSDs do have a
wear-out mechanism but wear can be monitored and devices replaced in
advance of failure. In practice longevity is such that most machines
will be in the dumpster long before the SSD wears out. We've had
machines running with several hundred wps constantly for 18 months using
Intel 710 drives and the wear level SMART value is still zero.

In addition, like any electronics module (CPU, memory, NIC), an SSD can
fail so you do need to arrange for valuable data to be replicated.
As with old school disk drives, firmware bugs are a concern so you might
want to consider what would happen if all the drives of a particular
type all decided to quit working at the same second in time (I've only
seen this happen myself with magnetic drives, but in theory it could
happen with SSD).







Re: New server setup

From
Mark Kirkwood
Date:
On 14/03/13 09:16, David Boreham wrote:
> On 3/13/2013 1:23 PM, Steve Crawford wrote:
>>
>> What concerns me more than wear is this:
>>
>> InfoWorld Article:
>> http://www.infoworld.com/t/solid-state-drives/test-your-ssds-or-risk-massive-data-loss-researchers-warn-213715
>>
>>
>> Referenced research paper:
>> https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault
>>
>>
>> Kind of messes with the "D" in ACID.
>
> It is somewhat surprising to discover that many SSD products are not
> durable under sudden power loss (what where they thinking!?, and ...why
> doesn't anyone care??).
>
> However, there is a set of SSD types known to be designed to address
> power loss events that have been tested by contributors to this list.
> Use only those devices and you won't see this problem. SSDs do have a
> wear-out mechanism but wear can be monitored and devices replaced in
> advance of failure. In practice longevity is such that most machines
> will be in the dumpster long before the SSD wears out. We've had
> machines running with several hundred wps constantly for 18 months using
> Intel 710 drives and the wear level SMART value is still zero.
>
> In addition, like any electronics module (CPU, memory, NIC), an SSD can
> fail so you do need to arrange for valuable data to be replicated.
> As with old school disk drives, firmware bugs are a concern so you might
> want to consider what would happen if all the drives of a particular
> type all decided to quit working at the same second in time (I've only
> seen this happen myself with magnetic drives, but in theory it could
> happen with SSD).
>
>

Just going through this now with a vendor. They initially assured us
that the drives had "end to end protection" so we did not need to worry.
I had to post stripdown pictures from Intel's s3700, showing obvious
capacitors attached to the board before I was taken seriously and
actually meaningful specifications were revealed. So now I'm demanding
to know:

- chipset (and version)
- original manufacturer (for re-badged ones)
- power off protection *explicitly* mentioned
- show me the circuit board (and where are the capacitors)

Seems like you gotta push 'em!

Cheers

Mark





Re: New server setup

From
David Boreham
Date:
On 3/13/2013 9:29 PM, Mark Kirkwood wrote:
Just going through this now with a vendor. They initially assured us that the drives had "end to end protection" so we did not need to worry. I had to post stripdown pictures from Intel's s3700, showing obvious capacitors attached to the board before I was taken seriously and actually meaningful specifications were revealed. So now I'm demanding to know:

- chipset (and version)
- original manufacturer (for re-badged ones)
- power off protection *explicitly* mentioned
- show me the circuit board (and where are the capacitors)

In addition to the above, I only use drives where I've seen compelling evidence that plug pull tests have been done and passed (e.g. done by someone on this list or in-house here).  I also like to have a high level of confidence in the firmware development group. This results in a very small set of acceptable products :(



Re: New server setup

From
Bruce Momjian
Date:
On Tue, Mar 12, 2013 at 09:41:08PM +0000, Gregg Jaskiewicz wrote:
> On 10 March 2013 15:58, Greg Smith <greg@2ndquadrant.com> wrote:
>
>     On 3/1/13 6:43 AM, Niels Kristian Schjødt wrote:
>
>         Hi, I'm going to setup a new server for my postgresql database, and I
>         am considering one of these: http://www.hetzner.de/hosting/
>         produkte_rootserver/poweredge-r720 with four SAS drives in a RAID 10
>         array. Has any of you any particular comments/pitfalls/etc. to mention
>         on the setup? My application is very write heavy.
>
>
>     The Dell PERC H710 (actually a LSI controller) works fine for write-heavy
>     workloads on a RAID 10, as long as you order it with a battery backup unit
>     module.  Someone must install the controller management utility and do
>     three things however:
>
>
> We're going to go with either HP or IBM (customer's preference, etc). 
>
>  
>
>     1) Make sure the battery-backup unit is working.
>
>     2) Configure the controller so that the *disk* write cache is off.

Only use SSDs with a BBU cache, and don't set SSD caches to
write-through because an SSD needs to cache the write to avoid wearing
out the chips early, see:

    http://momjian.us/main/blogs/pgblog/2012.html#August_3_2012

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +


Re: New server setup

From
Mark Kirkwood
Date:
On 15/03/13 07:54, Bruce Momjian wrote:
> Only use SSDs with a BBU cache, and don't set SSD caches to
> write-through because an SSD needs to cache the write to avoid wearing
> out the chips early, see:
>
>     http://momjian.us/main/blogs/pgblog/2012.html#August_3_2012
>

I not convinced about the need for BBU with SSD - you *can* use them
without one, just need to make sure about suitable longevity and also
the presence of (proven) power off protection (as discussed previously).
It is worth noting that using unproven or SSD known to be lacking power
off protection with a BBU will *not* save you from massive corruption
(or device failure) upon unexpected power loss.

Also, in terms of performance, the faster PCIe SSD do about as well by
themselves as connected to a RAID card with BBU. In fact they will do
better in some cases (the faster SSD can get close to the max IOPS many
RAID cards can handle...so more than a couple of 'em plugged into one
card will be throttled by its limitations).

Cheers

Mark


Re: New server setup

From
Mark Kirkwood
Date:
On 15/03/13 10:37, Mark Kirkwood wrote:
>
> Also, in terms of performance, the faster PCIe SSD do about as well by
> themselves as connected to a RAID card with BBU.
>

Sorry - I meant to say "the faster **SAS** SSD do...", since you can't
currently plug PCIe SSD into RAID cards (confusingly, some of the PCIe
guys actually have RAID card firmware on their boards...Intel 910 I think).

Cheers

Mark


Re: New server setup

From
Bruce Momjian
Date:
On Fri, Mar 15, 2013 at 10:37:55AM +1300, Mark Kirkwood wrote:
> On 15/03/13 07:54, Bruce Momjian wrote:
> >Only use SSDs with a BBU cache, and don't set SSD caches to
> >write-through because an SSD needs to cache the write to avoid wearing
> >out the chips early, see:
> >
> >    http://momjian.us/main/blogs/pgblog/2012.html#August_3_2012
> >
>
> I not convinced about the need for BBU with SSD - you *can* use them
> without one, just need to make sure about suitable longevity and
> also the presence of (proven) power off protection (as discussed
> previously). It is worth noting that using unproven or SSD known to
> be lacking power off protection with a BBU will *not* save you from
> massive corruption (or device failure) upon unexpected power loss.

I don't think any drive that corrupts on power-off is suitable for a
database, but for non-db uses, sure, I guess they are OK, though you
have to be pretty money-constrainted to like that tradeoff.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +


Re: New server setup

From
Mark Kirkwood
Date:
On 15/03/13 11:34, Bruce Momjian wrote:
>
> I don't think any drive that corrupts on power-off is suitable for a
> database, but for non-db uses, sure, I guess they are OK, though you
> have to be pretty money-constrainted to like that tradeoff.
>

Agreed - really *all* SSD should have capacitor (or equivalent) power
off protection...that fact that it's a feature present on only a handful
of drives is...disappointing.



Re: New server setup

From
David Boreham
Date:
On 3/14/2013 3:37 PM, Mark Kirkwood wrote:
I not convinced about the need for BBU with SSD - you *can* use them without one, just need to make sure about suitable longevity and also the presence of (proven) power off protection (as discussed previously). It is worth noting that using unproven or SSD known to be lacking power off protection with a BBU will *not* save you from massive corruption (or device failure) upon unexpected power loss.

I think it probably depends on the specifics of the deployment, but for us the fact that the BBU isn't required in order to achieve high write tps with SSDs is one of the key benefits -- the power, cooling and space savings over even a few servers are significant. In our case we only have one or two drives per server so no need for fancy drive string arrangements.

Also, in terms of performance, the faster PCIe SSD do about as well by themselves as connected to a RAID card with BBU. In fact they will do better in some cases (the faster SSD can get close to the max IOPS many RAID cards can handle...so more than a couple of 'em plugged into one card will be throttled by its limitations).

You might want to evaluate the performance you can achieve with a single-SSD (use several for capacity by all means) before considering a RAID card + SSD solution.
Again I bet it depends on the application but our experience with the older Intel 710 series is that their performance out-runs the CPU, at least under our PG workload.


Re: New server setup

From
Rick Otten
Date:
>> I not convinced about the need for BBU with SSD - you *can* use them
>> without one, just need to make sure about suitable longevity and also
>> the presence of (proven) power off protection (as discussed
>> previously). It is worth noting that using unproven or SSD known to be
>> lacking power off protection with a BBU will *not* save you from
>> massive corruption (or device failure) upon unexpected power loss.

>I don't think any drive that corrupts on power-off is suitable for a database, but for non-db uses, sure, I guess they
areOK, though you have to be pretty money->constrainted to like that tradeoff. 

Wouldn't mission critical databases normally be configured in a high availability cluster - presumably with replicas
runningon different power sources? 

If you lose power to a member of the cluster (or even the master), you would have new data coming in and stuff to do
longbefore it could come back online - corrupted disk or not. 

I find it hard to imagine configuring something that is too critical to be able to be restored from periodic backup to
NOTbe in a (synchronous) cluster.  I'm not sure all the fuss over whether an SSD might come back after a hard server
failureis really about.  You should architect the solution so you can lose the server and throw it away and never bring
itback online again.  Native streaming replication is fairly straightforward to configure.   Asynchronous multimaster
(albeitwith some synchronization latency) is also fairly easy to configure using third party tools such as SymmetricDS. 

Agreed that adding a supercap doesn't sound like a hard thing for a hardware manufacturer to do, but I don't think it
shouldbe a necessarily be showstopper for being able to take advantage of some awesome I/O performance opportunities. 







Re: New server setup

From
Bruce Momjian
Date:
On Fri, Mar 15, 2013 at 06:06:02PM +0000, Rick Otten wrote:
> >I don't think any drive that corrupts on power-off is suitable for a
> >database, but for non-db uses, sure, I guess they are OK, though you
> >have to be pretty money->constrainted to like that tradeoff.
>
> Wouldn't mission critical databases normally be configured in a high
> availability cluster - presumably with replicas running on different
> power sources?
>
> If you lose power to a member of the cluster (or even the master), you
> would have new data coming in and stuff to do long before it could
> come back online - corrupted disk or not.
>
> I find it hard to imagine configuring something that is too critical
> to be able to be restored from periodic backup to NOT be in a
> (synchronous) cluster.  I'm not sure all the fuss over whether an SSD
> might come back after a hard server failure is really about.  You
> should architect the solution so you can lose the server and throw
> it away and never bring it back online again.  Native streaming
> replication is fairly straightforward to configure.  Asynchronous
> multimaster (albeit with some synchronization latency) is also fairly
> easy to configure using third party tools such as SymmetricDS.
>
> Agreed that adding a supercap doesn't sound like a hard thing for
> a hardware manufacturer to do, but I don't think it should be a
> necessarily be showstopper for being able to take advantage of some
> awesome I/O performance opportunities.

Do you want to recreate the server if it loses power over an extra $100
per drive?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +


Re: New server setup

From
Scott Marlowe
Date:
On Fri, Mar 15, 2013 at 12:06 PM, Rick Otten <rotten@manta.com> wrote:
>>> I not convinced about the need for BBU with SSD - you *can* use them
>>> without one, just need to make sure about suitable longevity and also
>>> the presence of (proven) power off protection (as discussed
>>> previously). It is worth noting that using unproven or SSD known to be
>>> lacking power off protection with a BBU will *not* save you from
>>> massive corruption (or device failure) upon unexpected power loss.
>
>>I don't think any drive that corrupts on power-off is suitable for a database, but for non-db uses, sure, I guess
theyare OK, though you have to be pretty money->constrainted to like that tradeoff. 
>
> Wouldn't mission critical databases normally be configured in a high availability cluster - presumably with replicas
runningon different power sources? 

I've worked in high end data centers where certain failures resulted
in ALL power being lost.  more than once. Relying on never losing
power to keep your data from getting corrupted is not a good idea. Now
if they're geographically separate you're maybe ok.


Re: New server setup

From
Mark Kirkwood
Date:
On 16/03/13 07:06, Rick Otten wrote:
>>> I not convinced about the need for BBU with SSD - you *can* use them
>>> without one, just need to make sure about suitable longevity and also
>>> the presence of (proven) power off protection (as discussed
>>> previously). It is worth noting that using unproven or SSD known to be
>>> lacking power off protection with a BBU will *not* save you from
>>> massive corruption (or device failure) upon unexpected power loss.
>
>> I don't think any drive that corrupts on power-off is suitable for a database, but for non-db uses, sure, I guess
theyare OK, though you have to be pretty money->constrainted to like that tradeoff. 
>
> Wouldn't mission critical databases normally be configured in a high availability cluster - presumably with replicas
runningon different power sources? 
>
> If you lose power to a member of the cluster (or even the master), you would have new data coming in and stuff to do
longbefore it could come back online - corrupted disk or not. 
>
> I find it hard to imagine configuring something that is too critical to be able to be restored from periodic backup
toNOT be in a (synchronous) cluster.  I'm not sure all the fuss over whether an SSD might come back after a hard server
failureis really about.  You should architect the solution so you can lose the server and throw it away and never bring
itback online again.  Native streaming replication is fairly straightforward to configure.   Asynchronous multimaster
(albeitwith some synchronization latency) is also fairly easy to configure using third party tools such as SymmetricDS. 
>
> Agreed that adding a supercap doesn't sound like a hard thing for a hardware manufacturer to do, but I don't think it
shouldbe a necessarily be showstopper for being able to take advantage of some awesome I/O performance opportunities. 
>
>

A somewhat extreme point of view. I note that the Mongodb guys added
journaling for single server reliability a while ago - an admission that
while in *theory* lots of semi-reliable nodes can be eventually
consistent, it is a lot less hassle if individual nodes are as reliable
as possible. That is what this discussion is about.

Regards

Mark



Re: New server setup

From
David Rees
Date:
On Thu, Mar 14, 2013 at 4:37 PM, David Boreham <david_list@boreham.org> wrote:
> You might want to evaluate the performance you can achieve with a single-SSD
> (use several for capacity by all means) before considering a RAID card + SSD
> solution.
> Again I bet it depends on the application but our experience with the older
> Intel 710 series is that their performance out-runs the CPU, at least under
> our PG workload.

How many people are using a single enterprise grade SSD for production
without RAID? I've had a few consumer grade SSDs brick themselves -
but are the enterprise grade SSDs, like the new Intel S3700 which you
can get in sizes up to 800GB, reliable enough to run as a single drive
without RAID1? The performance of one is definitely good enough for
most medium sized workloads without the complexity of a BBU RAID and
multiple spinning disks...

-Dave


Re: New server setup

From
David Boreham
Date:
On 3/20/2013 6:44 PM, David Rees wrote:
> On Thu, Mar 14, 2013 at 4:37 PM, David Boreham <david_list@boreham.org> wrote:
>> You might want to evaluate the performance you can achieve with a single-SSD
>> (use several for capacity by all means) before considering a RAID card + SSD
>> solution.
>> Again I bet it depends on the application but our experience with the older
>> Intel 710 series is that their performance out-runs the CPU, at least under
>> our PG workload.
> How many people are using a single enterprise grade SSD for production
> without RAID? I've had a few consumer grade SSDs brick themselves -
> but are the enterprise grade SSDs, like the new Intel S3700 which you
> can get in sizes up to 800GB, reliable enough to run as a single drive
> without RAID1? The performance of one is definitely good enough for
> most medium sized workloads without the complexity of a BBU RAID and
> multiple spinning disks...
>

You're replying to my post, but I'll raise my hand again :)

We run a bunch of single-socket 1U, short-depth machines (Supermicro
chassis) using 1x Intel 710 drives (we'd use S3700 in new deployments
today). The most recent of these have 128G and E5-2620 hex-core CPU and
dissipate less than 150W at full-load.

Couldn't be happier with the setup. We have 18 months up time with no
drive failures, running at several hundred wps 7x24. We also write 10's
of GB of log files every day that are rotated, so the drives are getting
beaten up on bulk data overwrites too.

There is of course a non-zero probability of some unpleasant firmware
bug afflicting the drives (as with regular spinning drives), and
initially we deployed a "spare" 10k HD in the chassis, spun-down, that
would allow us to re-jigger the machines without SSD remotely (the data
center is 1000 miles away). We never had to do that, and later
deployments omitted the HD spare. We've also considered mixing SSD from
two vendors for firmware-bug-diversity, but so far we only have one
approved vendor (Intel).















Re: New server setup

From
Karl Denninger
Date:

On 3/20/2013 7:44 PM, David Rees wrote:
On Thu, Mar 14, 2013 at 4:37 PM, David Boreham <david_list@boreham.org> wrote:
You might want to evaluate the performance you can achieve with a single-SSD
(use several for capacity by all means) before considering a RAID card + SSD
solution.
Again I bet it depends on the application but our experience with the older
Intel 710 series is that their performance out-runs the CPU, at least under
our PG workload.
How many people are using a single enterprise grade SSD for production
without RAID? I've had a few consumer grade SSDs brick themselves -
but are the enterprise grade SSDs, like the new Intel S3700 which you
can get in sizes up to 800GB, reliable enough to run as a single drive
without RAID1? The performance of one is definitely good enough for
most medium sized workloads without the complexity of a BBU RAID and
multiple spinning disks...

-Dave
Two is one, one is none.
:-)

-
-- Karl Denninger
The Market Ticker ®
Cuda Systems LLC

Re: New server setup

From
Scott Marlowe
Date:
On Wed, Mar 20, 2013 at 6:44 PM, David Rees <drees76@gmail.com> wrote:
> On Thu, Mar 14, 2013 at 4:37 PM, David Boreham <david_list@boreham.org> wrote:
>> You might want to evaluate the performance you can achieve with a single-SSD
>> (use several for capacity by all means) before considering a RAID card + SSD
>> solution.
>> Again I bet it depends on the application but our experience with the older
>> Intel 710 series is that their performance out-runs the CPU, at least under
>> our PG workload.
>
> How many people are using a single enterprise grade SSD for production
> without RAID? I've had a few consumer grade SSDs brick themselves -
> but are the enterprise grade SSDs, like the new Intel S3700 which you
> can get in sizes up to 800GB, reliable enough to run as a single drive
> without RAID1? The performance of one is definitely good enough for
> most medium sized workloads without the complexity of a BBU RAID and
> multiple spinning disks...

I would still at least run two in software RAID-1 for reliability.


Re: New server setup

From
Mark Kirkwood
Date:
On 21/03/13 13:44, David Rees wrote:
> On Thu, Mar 14, 2013 at 4:37 PM, David Boreham <david_list@boreham.org> wrote:
>> You might want to evaluate the performance you can achieve with a single-SSD
>> (use several for capacity by all means) before considering a RAID card + SSD
>> solution.
>> Again I bet it depends on the application but our experience with the older
>> Intel 710 series is that their performance out-runs the CPU, at least under
>> our PG workload.
>
> How many people are using a single enterprise grade SSD for production
> without RAID? I've had a few consumer grade SSDs brick themselves -
> but are the enterprise grade SSDs, like the new Intel S3700 which you
> can get in sizes up to 800GB, reliable enough to run as a single drive
> without RAID1? The performance of one is definitely good enough for
> most medium sized workloads without the complexity of a BBU RAID and
> multiple spinning disks...
>

If you are using Intel S3700 or 710's you can certainly use a pair setup
in software RAID1 (so avoiding the need for RAID cards and BBU etc).

I'd certainly feel happier with 2 drives :-) . However, a setup using
replication with a number of hosts - each with a single SSD is going to
be ok.

Regards

Mark