Thread: Deploying PostgreSQL on CentOS with SSD and Hardware RAID
Hello. We're intending to deploy PostgreSQL on Linux with SSD drives which would be in a RAID 1 configuration with Hardware RAID. My first question is essentially: are there any issues we need to be aware of when running PostgreSQL 9 on CentOS 6 on aserver with SSD drives in a Hardware RAID 1 configuration? Will there be any compatibility problems (seems unlikely)? Shouldwe consider alternative configurations as being more effective for getting better use out of the hardware? The second question is: are there any SSD-specific issues to be aware of when tuning PostgreSQL to make the best use of thishardware and software? The specific hardware we're planning to use is the HP ProLiant DL360 Gen8 server with P420i RAID controller, and two MLCSSDs in RAID 1 for the OS, and two SLC SSDs in RAID 1 for the database - but I guess it isn't necessary to have used thisspecific hardware setup in order to have experience with these general issues. The P420i controller appears to be compatiblewith recent versions of CentOS, so drivers should not be a concern (hopefully). Any insights anyone can offer on these issues would be most welcome. Regards, Matt.
On Fri, May 10, 2013 at 9:14 AM, Matt Brock <mb@mattbrock.co.uk> wrote: > Hello. > > We're intending to deploy PostgreSQL on Linux with SSD drives which would be in a RAID 1 configuration with Hardware RAID. > > My first question is essentially: are there any issues we need to be aware of when running PostgreSQL 9 on CentOS 6 ona server with SSD drives in a Hardware RAID 1 configuration? Will there be any compatibility problems (seems unlikely)?Should we consider alternative configurations as being more effective for getting better use out of the hardware? > > The second question is: are there any SSD-specific issues to be aware of when tuning PostgreSQL to make the best use ofthis hardware and software? > > The specific hardware we're planning to use is the HP ProLiant DL360 Gen8 server with P420i RAID controller, and two MLCSSDs in RAID 1 for the OS, and two SLC SSDs in RAID 1 for the database - but I guess it isn't necessary to have used thisspecific hardware setup in order to have experience with these general issues. The P420i controller appears to be compatiblewith recent versions of CentOS, so drivers should not be a concern (hopefully). The specific drive models play a huge impact on SSD performance. In fact, the point you are using SLC drives suggests you might be using antiquated (by SSD standards) hardware. All the latest action is on MLC now (see here: http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html). merlin
After googling this for a while, it seems that High Endurance MLC is only starting to rival SLC for endurance and write performancein the very latest, cutting-edge hardware. In general, though, it seems it would be fair to say that SLCs arestill a better bet for databases than MLC? The number and capacity of drives is small in this instance, and the price difference between the two for HP SSDs isn't verywide, so cost isn't really an issue. We just want to use whichever is better for the database. On 10 May 2013, at 15:20, Merlin Moncure <mmoncure@gmail.com> wrote: > On Fri, May 10, 2013 at 9:14 AM, Matt Brock <mb@mattbrock.co.uk> wrote: >> Hello. >> >> We're intending to deploy PostgreSQL on Linux with SSD drives which would be in a RAID 1 configuration with Hardware RAID. >> >> My first question is essentially: are there any issues we need to be aware of when running PostgreSQL 9 on CentOS 6 ona server with SSD drives in a Hardware RAID 1 configuration? Will there be any compatibility problems (seems unlikely)?Should we consider alternative configurations as being more effective for getting better use out of the hardware? >> >> The second question is: are there any SSD-specific issues to be aware of when tuning PostgreSQL to make the best use ofthis hardware and software? >> >> The specific hardware we're planning to use is the HP ProLiant DL360 Gen8 server with P420i RAID controller, and two MLCSSDs in RAID 1 for the OS, and two SLC SSDs in RAID 1 for the database - but I guess it isn't necessary to have used thisspecific hardware setup in order to have experience with these general issues. The P420i controller appears to be compatiblewith recent versions of CentOS, so drivers should not be a concern (hopefully). > > The specific drive models play a huge impact on SSD performance. In > fact, the point you are using SLC drives suggests you might be using > antiquated (by SSD standards) hardware. All the latest action is on > MLC now (see here: > http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html). > > merlin > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On 5/10/2013 9:19 AM, Matt Brock wrote: > After googling this for a while, it seems that High Endurance MLC is only starting to rival SLC for endurance and writeperformance in the very latest, cutting-edge hardware. In general, though, it seems it would be fair to say that SLCsare still a better bet for databases than MLC? I've never looked at SLC drives in the past few years and don't know anyone who uses them these days. > > The number and capacity of drives is small in this instance, and the price difference between the two for HP SSDs isn'tvery wide, so cost isn't really an issue. We just want to use whichever is better for the database. > > Could you post some specific drive models please ? HP probably doesn't make the drives, and it really helps to know what devices you're using since they are not nearly as generic in behavior and features as magnetic drives.
Not sure of your space requirements, but I'd think a RAID 10 of 8x or more Samsung 840 Pro 256/512 GB would be the best value. Using a simple mirror won't get you the reliability that you want since heavy writing will burn the drives out over time, and if you're writing the exact same content to both drives, they could likely fail at the same time. Regardless of the underlying hardware you should still follow best practices for provisioning disks, and raid 10 is the way to go. I don't know what your budget is though. Anyway, mirrored SSD will probably work fine, but I'd avoid using just two drives for the reasons above. I'd suggest at least testing RAID 5 or something else to spread the load around. Personally, I think the ideal configuration would be a RAID 10 of at least 8 disks plus 1 hot spare. The Samsung 840 Pro 256 GB are frequently $200 on sale at Newegg. YMMV but they are amazing drives.
On Fri, May 10, 2013 at 11:25 AM, David Boreham <david_list@boreham.org> wrote:
On 5/10/2013 9:19 AM, Matt Brock wrote:I've never looked at SLC drives in the past few years and don't know anyone who uses them these days.After googling this for a while, it seems that High Endurance MLC is only starting to rival SLC for endurance and write performance in the very latest, cutting-edge hardware. In general, though, it seems it would be fair to say that SLCs are still a better bet for databases than MLC?Could you post some specific drive models please ? HP probably doesn't make the drives, and it really helps to know what devices you're using since they are not nearly as generic in behavior and features as magnetic drives.
The number and capacity of drives is small in this instance, and the price difference between the two for HP SSDs isn't very wide, so cost isn't really an issue. We just want to use whichever is better for the database.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, May 10, 2013 at 10:19 AM, Matt Brock <mb@mattbrock.co.uk> wrote: > After googling this for a while, it seems that High Endurance MLC is only starting to rival SLC for endurance and writeperformance in the very latest, cutting-edge hardware. In general, though, it seems it would be fair to say that SLCsare still a better bet for databases than MLC? > > The number and capacity of drives is small in this instance, and the price difference between the two for HP SSDs isn'tvery wide, so cost isn't really an issue. We just want to use whichever is better for the database. Well, it's more complicated than that. While SLC drives were indeed inherently faster and had longer lifespans, all flash drives basically have the requirement of having to carefully manages writes in order to get good performance. Unfortunately, this means that for database use the drives must have some type of non-volatile cache and/or sufficient back up juice in a capacitor to spin out pending write in the event of sudden loss of power. Many drives, including (famously) the so-called Intel X25-E "enterprise" lines, did not do this and where therefore unsuitable for database use. As it turns out the list of flash drives are suitable for database use is surprisingly small. The s3700 I noted upthread seems to be specifically built with databases in mind and is likely the best choice for new deployments. The older Intel 320 is also a good choice. I think that's pretty much it until you get into expensive pci-e based gear. There might be some non-intel drives out there that are suitable but be very very careful and triple verify that the drive has on-board capacitor and has gotten real traction in enterprise database usage. merlin
On Fri, May 10, 2013 at 11:11 AM, Evan D. Hoffman <evandhoffman@gmail.com> wrote: > Not sure of your space requirements, but I'd think a RAID 10 of 8x or more > Samsung 840 Pro 256/512 GB would be the best value. Using a simple mirror > won't get you the reliability that you want since heavy writing will burn > the drives out over time, and if you're writing the exact same content to > both drives, they could likely fail at the same time. Regardless of the > underlying hardware you should still follow best practices for provisioning > disks, and raid 10 is the way to go. I don't know what your budget is > though. Anyway, mirrored SSD will probably work fine, but I'd avoid using > just two drives for the reasons above. I'd suggest at least testing RAID 5 > or something else to spread the load around. Personally, I think the ideal > configuration would be a RAID 10 of at least 8 disks plus 1 hot spare. The > Samsung 840 Pro 256 GB are frequently $200 on sale at Newegg. YMMV but they > are amazing drives. Samsung 840 has no power loss protection and is therefore useless for database use IMO unless you don't care about data safety and/or are implementing redundancy via some other method (say, by synchronous replication). merlin
I'd expect to use a RAID controller with either BBU or NVRAM cache to handle that, and that the server itself would be on UPS for a production DB. That said, a standby replica DB on conventional disk is definitely a good idea in any case.
On Fri, May 10, 2013 at 12:25 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Fri, May 10, 2013 at 11:11 AM, Evan D. HoffmanSamsung 840 has no power loss protection and is therefore useless for
<evandhoffman@gmail.com> wrote:
> Not sure of your space requirements, but I'd think a RAID 10 of 8x or more
> Samsung 840 Pro 256/512 GB would be the best value. Using a simple mirror
> won't get you the reliability that you want since heavy writing will burn
> the drives out over time, and if you're writing the exact same content to
> both drives, they could likely fail at the same time. Regardless of the
> underlying hardware you should still follow best practices for provisioning
> disks, and raid 10 is the way to go. I don't know what your budget is
> though. Anyway, mirrored SSD will probably work fine, but I'd avoid using
> just two drives for the reasons above. I'd suggest at least testing RAID 5
> or something else to spread the load around. Personally, I think the ideal
> configuration would be a RAID 10 of at least 8 disks plus 1 hot spare. The
> Samsung 840 Pro 256 GB are frequently $200 on sale at Newegg. YMMV but they
> are amazing drives.
database use IMO unless you don't care about data safety and/or are
implementing redundancy via some other method (say, by synchronous
replication).
merlin
On Fri, May 10, 2013 at 11:34 AM, Evan D. Hoffman <evandhoffman@gmail.com> wrote: > I'd expect to use a RAID controller with either BBU or NVRAM cache to handle > that, and that the server itself would be on UPS for a production DB. That > said, a standby replica DB on conventional disk is definitely a good idea in > any case. Sadly, NVRAM cache doesn't help (unless the raid controller is managing drive writes down to the flash level and no such products exist that I am aware of). The problem is that provide guarantees the raid controller still needs to be able to tell the device to flush down to physical storage. While flash drives can be configured to do that (basically write-through mode), it's pretty silly to do so as it will ruin performance and quickly destroy the drive. Trusting UPS is up to you, but if your ups does, someone knocks the power cable, etc you have data loss. With on-drive capacitor you only get data loss via physical damage or corruption on the drive. merlin
On 5/10/2013 10:21 AM, Merlin Moncure wrote: > As it turns out the list of flash drives are suitable for database use > is surprisingly small. The s3700 I noted upthread seems to be > specifically built with databases in mind and is likely the best > choice for new deployments. The older Intel 320 is also a good choice. > I think that's pretty much it until you get into expensive pci-e based > gear. This may have been a typo : did you mean Intel 710 series rather than 320 ? While the 320 has the supercap, it isn't specified for high write endurance. Definitely usable for a database, and a better choice than most of the alternatives, but I'd have listed the 710 ahead of the 320.
On Fri, May 10, 2013 at 12:03 PM, David Boreham <david_list@boreham.org> wrote: > On 5/10/2013 10:21 AM, Merlin Moncure wrote: >> >> As it turns out the list of flash drives are suitable for database use is >> surprisingly small. The s3700 I noted upthread seems to be specifically >> built with databases in mind and is likely the best choice for new >> deployments. The older Intel 320 is also a good choice. I think that's >> pretty much it until you get into expensive pci-e based gear. > > > This may have been a typo : did you mean Intel 710 series rather than 320 ? > > While the 320 has the supercap, it isn't specified for high write endurance. > Definitely usable for a database, and a better choice than most of the > alternatives, but I'd have listed the 710 ahead of the 320. It wasn't a typo. The 320 though is perfectly fine although it will wear out faster -- so it fills a niche for low write intensity applications. I find the s3700 to be superior to the 710 in just about every way (although you're right -- it is suitable for database use). merlin
On 05/10/2013 12:46 PM, Merlin Moncure wrote: > On Fri, May 10, 2013 at 11:34 AM, Evan D. Hoffman > <evandhoffman@gmail.com> wrote: >> I'd expect to use a RAID controller with either BBU or NVRAM cache to handle >> that, and that the server itself would be on UPS for a production DB. That >> said, a standby replica DB on conventional disk is definitely a good idea in >> any case. > Sadly, NVRAM cache doesn't help (unless the raid controller is > managing drive writes down to the flash level and no such products > exist that I am aware of). The problem is that provide guarantees the > raid controller still needs to be able to tell the device to flush > down to physical storage. While flash drives can be configured to do > that (basically write-through mode), it's pretty silly to do so as it > will ruin performance and quickly destroy the drive. > > Trusting UPS is up to you, but if your ups does, someone knocks the > power cable, etc you have data loss. With on-drive capacitor you only > get data loss via physical damage or corruption on the drive. > > merlin > Well we have dual redundant power supplies on separate UPS so could something go wrong yes, but a tornado could come along and destroy the building also. -- Stephen Clark *NetWolves* Director of Technology Phone: 813-579-3200 Fax: 813-882-0209 Email: steve.clark@netwolves.com http://www.netwolves.com
On Fri, May 10, 2013 at 10:20 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Fri, May 10, 2013 at 12:03 PM, David Boreham <david_list@boreham.org> wrote: >> On 5/10/2013 10:21 AM, Merlin Moncure wrote: >>> >>> As it turns out the list of flash drives are suitable for database use is >>> surprisingly small. The s3700 I noted upthread seems to be specifically >>> built with databases in mind and is likely the best choice for new >>> deployments. The older Intel 320 is also a good choice. I think that's >>> pretty much it until you get into expensive pci-e based gear. >> >> >> This may have been a typo : did you mean Intel 710 series rather than 320 ? >> >> While the 320 has the supercap, it isn't specified for high write endurance. >> Definitely usable for a database, and a better choice than most of the >> alternatives, but I'd have listed the 710 ahead of the 320. > > It wasn't a typo. The 320 though is perfectly fine although it will > wear out faster -- so it fills a niche for low write intensity > applications. I find the s3700 to be superior to the 710 in just > about every way (although you're right -- it is suitable for database > use). There's also the 520 series, which has better performance than the 320 series (which is EOL now).
On 5/10/2013 11:20 AM, Merlin Moncure wrote: > I find the s3700 to be superior to the 710 in just about every way > (although you're right -- it is suitable for database use). merlin The s3700 series replaces the 710 so it should be superior :)
On 5/10/2013 11:23 AM, Lonni J Friedman wrote: > There's also the 520 series, which has better performance than the 320 > series (which is EOL now). I wouldn't use the 520 series for production database storage -- it has the Sandforce controller and apparently no power failure protection.
Steve Clark escribió: > Well we have dual redundant power supplies on separate UPS so could something go wrong yes, but a tornado could > come along and destroy the building also. .. hence your standby server across the country? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On May 10, 2013, at 7:14 AM, Matt Brock <mb@mattbrock.co.uk> wrote: > Hello. > > We're intending to deploy PostgreSQL on Linux with SSD drives which would be in a RAID 1 configuration with Hardware RAID. > > My first question is essentially: are there any issues we need to be aware of when running PostgreSQL 9 on CentOS 6 ona server with SSD drives in a Hardware RAID 1 configuration? Will there be any compatibility problems (seems unlikely)?Should we consider alternative configurations as being more effective for getting better use out of the hardware? > > The second question is: are there any SSD-specific issues to be aware of when tuning PostgreSQL to make the best use ofthis hardware and software? > A couple of things I noticed with a similar-ish setup: * Some forms of RAID / LVM break the kernel's automatic disk tuning mechanism. In particular, there is a "rotational" tunablethat often does not get set right. You might end up tweaking read ahead and friends as well. http://www.mjmwired.net/kernel/Documentation/block/queue-sysfs.txt#112 * The default Postgres configuration is awful for a SSD backed database. You really need to futz with checkpoints to getacceptable throughput. The "PostgreSQL 9.0 High Performance" book is fantastic and is what I used to great success. * The default Linux virtual memory configuration is awful for this configuration. Briefly, it will accept a ton of incomingdata, and then go through an awful stall as soon as it calls fsync() to write all that data to disk. We had multi-seconddelays all the way through to the application because of this. We had to change the zone_reclaim_mode and thedirty buffer limits. http://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com I am not sure that these numbers will end up being anywhere near what works for you, but these are my notes from tuning a4xMLC SSD RAID-10. I haven't proven that this is optimal, but it was way better than the defaults. We ended up with thefollowing list of changes: * Change IO scheduler to "noop" * Mount DB volume with nobarrier, noatime * Turn blockdev readahead to 16MiB * Turn sdb's "rotational" tuneable to 0 PostgreSQL configuration changes: synchronous_commit = off effective_io_concurrency = 4 checkpoint_segments = 1024 checkpoint_timeout = 10min checkpoint_warning = 8min shared_buffers = 32gb temp_buffers = 128mb work_mem = 512mb maintenance_work_mem = 1gb Linux sysctls: vm.swappiness = 0 vm.zone_reclaim_mode = 0 vm.dirty_bytes = 134217728 vm.dirty_background_bytes = 1048576 Hope that helps, Steven
On Fri, May 10, 2013 at 11:23 AM, Steven Schlansker <steven@likeness.com> wrote: > > On May 10, 2013, at 7:14 AM, Matt Brock <mb@mattbrock.co.uk> wrote: > >> Hello. >> >> We're intending to deploy PostgreSQL on Linux with SSD drives which would be in a RAID 1 configuration with Hardware RAID. >> >> My first question is essentially: are there any issues we need to be aware of when running PostgreSQL 9 on CentOS 6 ona server with SSD drives in a Hardware RAID 1 configuration? Will there be any compatibility problems (seems unlikely)?Should we consider alternative configurations as being more effective for getting better use out of the hardware? >> >> The second question is: are there any SSD-specific issues to be aware of when tuning PostgreSQL to make the best use ofthis hardware and software? >> > > A couple of things I noticed with a similar-ish setup: > > * Some forms of RAID / LVM break the kernel's automatic disk tuning mechanism. In particular, there is a "rotational"tunable that often does not get set right. You might end up tweaking read ahead and friends as well. > http://www.mjmwired.net/kernel/Documentation/block/queue-sysfs.txt#112 > > * The default Postgres configuration is awful for a SSD backed database. You really need to futz with checkpoints to getacceptable throughput. > The "PostgreSQL 9.0 High Performance" book is fantastic and is what I used to great success. > > * The default Linux virtual memory configuration is awful for this configuration. Briefly, it will accept a ton of incomingdata, and then go through an awful stall as soon as it calls fsync() to write all that data to disk. We had multi-seconddelays all the way through to the application because of this. We had to change the zone_reclaim_mode and thedirty buffer limits. > http://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com > > > > I am not sure that these numbers will end up being anywhere near what works for you, but these are my notes from tuninga 4xMLC SSD RAID-10. I haven't proven that this is optimal, but it was way better than the defaults. We ended upwith the following list of changes: > > * Change IO scheduler to "noop" > * Mount DB volume with nobarrier, noatime > * Turn blockdev readahead to 16MiB > * Turn sdb's "rotational" tuneable to 0 > > PostgreSQL configuration changes: > synchronous_commit = off > effective_io_concurrency = 4 > checkpoint_segments = 1024 > checkpoint_timeout = 10min > checkpoint_warning = 8min > shared_buffers = 32gb > temp_buffers = 128mb > work_mem = 512mb > maintenance_work_mem = 1gb > > Linux sysctls: > vm.swappiness = 0 > vm.zone_reclaim_mode = 0 > vm.dirty_bytes = 134217728 > vm.dirty_background_bytes = 1048576 Can you provide more details about your setup, including: * What kind of filesystem are you using? * Linux distro and/or kernel version * hardware (RAM, CPU cores etc) * database usage patterns (% writes, growth, etc) thanks
On Fri, May 10, 2013 at 1:23 PM, Steven Schlansker <steven@likeness.com> wrote: > > On May 10, 2013, at 7:14 AM, Matt Brock <mb@mattbrock.co.uk> wrote: > >> Hello. >> >> We're intending to deploy PostgreSQL on Linux with SSD drives which would be in a RAID 1 configuration with Hardware RAID. >> >> My first question is essentially: are there any issues we need to be aware of when running PostgreSQL 9 on CentOS 6 ona server with SSD drives in a Hardware RAID 1 configuration? Will there be any compatibility problems (seems unlikely)?Should we consider alternative configurations as being more effective for getting better use out of the hardware? >> >> The second question is: are there any SSD-specific issues to be aware of when tuning PostgreSQL to make the best use ofthis hardware and software? >> > > A couple of things I noticed with a similar-ish setup: > > * Some forms of RAID / LVM break the kernel's automatic disk tuning mechanism. In particular, there is a "rotational"tunable that often does not get set right. You might end up tweaking read ahead and friends as well. > http://www.mjmwired.net/kernel/Documentation/block/queue-sysfs.txt#112 > > * The default Postgres configuration is awful for a SSD backed database. You really need to futz with checkpoints to getacceptable throughput. > The "PostgreSQL 9.0 High Performance" book is fantastic and is what I used to great success. > > * The default Linux virtual memory configuration is awful for this configuration. Briefly, it will accept a ton of incomingdata, and then go through an awful stall as soon as it calls fsync() to write all that data to disk. We had multi-seconddelays all the way through to the application because of this. We had to change the zone_reclaim_mode and thedirty buffer limits. > http://www.postgresql.org/message-id/500616CB.3070408@2ndQuadrant.com > > > > I am not sure that these numbers will end up being anywhere near what works for you, but these are my notes from tuninga 4xMLC SSD RAID-10. I haven't proven that this is optimal, but it was way better than the defaults. We ended upwith the following list of changes: > > * Change IO scheduler to "noop" > * Mount DB volume with nobarrier, noatime > * Turn blockdev readahead to 16MiB > * Turn sdb's "rotational" tuneable to 0 > > PostgreSQL configuration changes: > synchronous_commit = off > effective_io_concurrency = 4 > checkpoint_segments = 1024 > checkpoint_timeout = 10min > checkpoint_warning = 8min > shared_buffers = 32gb > temp_buffers = 128mb > work_mem = 512mb > maintenance_work_mem = 1gb > > Linux sysctls: > vm.swappiness = 0 > vm.zone_reclaim_mode = 0 > vm.dirty_bytes = 134217728 > vm.dirty_background_bytes = 1048576 that's good info, but it should be noted that synchronous_commit trades a risk of some data loss (but not nearly as much risk as volatile storage) for a big increase in commit performance. merlin
On May 10, 2013, at 11:38 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> >> PostgreSQL configuration changes: >> synchronous_commit = off >> > > that's good info, but it should be noted that synchronous_commit > trades a risk of some data loss (but not nearly as much risk as > volatile storage) for a big increase in commit performance. Yes, that is a choice we consciously made. If our DB server crashes losing the last few ms worth of transactions is an acceptableloss to us. But that may not be OK for everyone :-)
On May 10, 2013, at 11:35 AM, Lonni J Friedman <netllama@gmail.com> wrote: >> >> I am not sure that these numbers will end up being anywhere near what works for you, but these are my notes from tuninga 4xMLC SSD RAID-10. I haven't proven that this is optimal, but it was way better than the defaults. We ended upwith the following list of changes: >> >> * Change IO scheduler to "noop" >> * Mount DB volume with nobarrier, noatime >> * Turn blockdev readahead to 16MiB >> * Turn sdb's "rotational" tuneable to 0 >> >> PostgreSQL configuration changes: >> synchronous_commit = off >> effective_io_concurrency = 4 >> checkpoint_segments = 1024 >> checkpoint_timeout = 10min >> checkpoint_warning = 8min >> shared_buffers = 32gb >> temp_buffers = 128mb >> work_mem = 512mb >> maintenance_work_mem = 1gb >> >> Linux sysctls: >> vm.swappiness = 0 >> vm.zone_reclaim_mode = 0 >> vm.dirty_bytes = 134217728 >> vm.dirty_background_bytes = 1048576 > > Can you provide more details about your setup, including: > * What kind of filesystem are you using? > * Linux distro and/or kernel version > * hardware (RAM, CPU cores etc) > * database usage patterns (% writes, growth, etc) Yes, as long as you promise not to just use my configuration without doing proper testing on your own system, even if itseems similar! Linux version 2.6.32.225 (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #2 SMP Thu Mar 29 16:43:20 EDT 2012 DMI: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011 CPU0: Intel(R) Xeon(R) CPU X5670 @ 2.93GHz stepping 02 Total of 24 processors activated (140796.98 BogoMIPS). (2 socket x 2 hyperthread x 6 cores) 96GB ECC RAM Filesystem is ext4 on LVM on hardware RAID 1+0 Adaptec 5405 Database is very much read heavy, but there is a base load of writes and bursts of much larger writes. I don't have specificsregarding how it breaks down. The database is about 400GB and is growing moderately, maybe a few GB/day. Moreof the write traffic is re-writes rather than writes. Hope that helps, Steven
On 05/10/2013 12:46 PM, Merlin Moncure wrote:
Well we have dual redundant power supplies on separate UPS so could something go wrong yes, but a tornado couldOn Fri, May 10, 2013 at 11:34 AM, Evan D. Hoffman <evandhoffman@gmail.com> wrote:I'd expect to use a RAID controller with either BBU or NVRAM cache to handle that, and that the server itself would be on UPS for a production DB. That said, a standby replica DB on conventional disk is definitely a good idea in any case.Sadly, NVRAM cache doesn't help (unless the raid controller is managing drive writes down to the flash level and no such products exist that I am aware of). The problem is that provide guarantees the raid controller still needs to be able to tell the device to flush down to physical storage. While flash drives can be configured to do that (basically write-through mode), it's pretty silly to do so as it will ruin performance and quickly destroy the drive. Trusting UPS is up to you, but if your ups does, someone knocks the power cable, etc you have data loss. With on-drive capacitor you only get data loss via physical damage or corruption on the drive. merlin
come along and destroy the building also.
--
Stephen Clark
NetWolves
Director of Technology
Phone: 813-579-3200
Fax: 813-882-0209
Email: steve.clark@netwolves.com
http://www.netwolves.com
Stephen Clark
NetWolves
Director of Technology
Phone: 813-579-3200
Fax: 813-882-0209
Email: steve.clark@netwolves.com
http://www.netwolves.com
On 10 May 2013, at 16:25, David Boreham <david_list@boreham.org> wrote: > I've never looked at SLC drives in the past few years and don't know anyone who uses them these days. Because SLCs are still more expensive? Because MLCs are now almost as good as SLCs for performance/endurance? I should point out that this database will be the backend for a high-transaction gaming site with very heavy database usageincluding a lot of writes. Disk IO on the database server has always been our bottleneck so far. Also, the database is kept comparatively very small - about 25 GB currently, and it will grow to perhaps 50 GB this yearas a result of new content and traffic coming in. So whilst MLCs might be almost as good as SLCs now, the price difference for us is so insignificant that if we can stillget a small improvement with SLCs then we might as well do so. > Could you post some specific drive models please ? HP probably doesn't make the drives, and it really helps to know whatdevices you're using since they are not nearly as generic in behavior and features as magnetic drives. I've asked our HP dealer for this information since unfortunately it doesn't appear to be available on the HP website - hopefullyit will be forthcoming at some point. Matt.
On 05/10/2013 11:38 AM, Merlin Moncure wrote: >> PostgreSQL configuration changes: >> synchronous_commit = off >> effective_io_concurrency = 4 >> checkpoint_segments = 1024 >> checkpoint_timeout = 10min >> checkpoint_warning = 8min >> shared_buffers = 32gb >> temp_buffers = 128mb >> work_mem = 512mb >> maintenance_work_mem = 1gb > > that's good info, but it should be noted that synchronous_commit > trades a risk of some data loss (but not nearly as much risk as > volatile storage) for a big increase in commit performance. Yeah but it is an extremely low risk, and probably lower than say... a bad Apache form submission. Generally the database is the most reliable hardware in the cluster. It is also not a risk for corruption which a lot of people confuse it for. One thing I would note is that work_mem is very high, that might be alright with an SSD environment because if we go out to tape sort, it is still going to be fast but it is something to consider. Another thing is, why such a low checkpoint_timout? Set it to 60 minutes and be done with it. The bgwriter should be dealing with these problems. Sincerely, JD > > merlin > >
On 5/11/2013 3:10 AM, Matt Brock wrote: > On 10 May 2013, at 16:25, David Boreham <david_list@boreham.org> wrote: > >> I've never looked at SLC drives in the past few years and don't know anyone who uses them these days. > Because SLCs are still more expensive? Because MLCs are now almost as good as SLCs for performance/endurance? Not quite. More like : a) I don't know where to buy SLC drives in 2013 (all the drives for example for sale on newegg.com are MLC) and b) today's MLC drives are quite good enough for me (and I'd venture to say any database-related purpose). > > I should point out that this database will be the backend for a high-transaction gaming site with very heavy database usageincluding a lot of writes. Disk IO on the database server has always been our bottleneck so far. Sure, same here. I wouldn't be replying if all I did was run an SSD drive in my laptop ;) > > Also, the database is kept comparatively very small - about 25 GB currently, and it will grow to perhaps 50 GB this yearas a result of new content and traffic coming in. Our clustering partitions the user population between servers such that each server has a database about the same size as yours. We tend to use 200 or 300G drives on each box, allowing plenty space for the DB, a copy of another server's DB, and log files. > > I've asked our HP dealer for this information since unfortunately it doesn't appear to be available on the HP website -hopefully it will be forthcoming at some point. > This is a bit of a red flag for me. During the qualification process for our SSD drives we: read the technical papers from Intel; ran lab tests where we saturated a drive with writes for weeks, checking the write endurance SMART data and operation latency; modified smartools so it could read all the useful drive counters, and also reset the wear estimation counters; performed power cable pull tests; read everything posted on this list by people who had done serious testing in addition to the tests we ran in house. I'm not sure I'd want to deploy "Joe Random SSD du jour" that HP decided to ship me. You might consider buying boxes sans drives and fitting your own, of a known trusted type.
On 5/12/2013 6:13 PM, David Boreham wrote: > > Not quite. More like : a) I don't know where to buy SLC drives in 2013 > (all the drives for example for sale on newegg.com are MLC) and b) > today's MLC drives are quite good enough for me (and I'd venture to > say any database-related purpose). Newegg wouldn't know 'enterprise' if it bit them. they just sell mass market consumer stuff and gamer kit. the real SLC drives end up OEM branded in large SAN systems, such as sold by Netapp, EMC, and are made by companies like STEC that have zero presence in the 'whitebox' resale markets like Newegg. -- john r pierce 37N 122W somewhere on the middle of the left coast
btw we deploy on CentOS6. The only things we change from the default are: 1. add "relatime,discard" options to the mount (check whether the most recent CentOS6 does this itself -- it didn't back when we first deployed on 6.0). 2. Disable swap. This isn't strictly an SSD tweak, since we have enough physical memory to not need to swap, but it is a useful measure for us since the default install always creates a swap partition which a) uses valuable space on the smaller-sized SSDs, and b) if there are ever writes to the swap partition it would be bad for wear on the entire drive. We also setup monitoring of the drives' smart wear counter to ensure early warning of any drive coming close to wear out. We do not use (and don't like) RAID with SSDs.
On 5/12/2013 7:20 PM, John R Pierce wrote: > > the real SLC drives end up OEM branded in large SAN systems, such as > sold by Netapp, EMC, and are made by companies like STEC that have > zero presence in the 'whitebox' resale markets like Newegg. > Agreed. I don't go near the likes of Simple, HGST, F-IO, SMART, et al. For me this is SAS and SCSI re-born -- an excuse to charge very high prices for a product not significantly different from a much cheaper mainstream alternative, by exploiting unsophisticated purchasers with tales of enterprise snake oil. ymmv, but since I am spending my own $$, I'll stick to product I can order from the likes of Newegg and Amazon.
On 5/12/2013 6:41 PM, David Boreham wrote: > Agreed. I don't go near the likes of Simple, HGST, F-IO, SMART, et al. > For me this is SAS and SCSI re-born -- an excuse to charge very high > prices for a product not significantly different from a much cheaper > mainstream alternative, by exploiting unsophisticated purchasers with > tales of enterprise snake oil. > > ymmv, but since I am spending my own $$, I'll stick to product I can > order from the likes of Newegg and Amazon. except, those high end enterprise products ARE write-safe while most SATA SSD's aren't. The enterprise SSD's like STEC are engineered for much higher write cycle life times. the whole storage infrastructure has several more 9's on its reliability. those extra 9's don't come cheap. and if [Bigname Vendor] sells you a system with storage controller X and drives Y, you can generally assume they've been tested together. they are also selling their service and support, for better or worse. SAS has better IO concurrency than SATA, and SAS drives are dual path, which means you can have redundant storage channels and cabling, even if you're doing JBOD.. -- john r pierce 37N 122W somewhere on the middle of the left coast
On Sun, May 12, 2013 at 8:20 PM, John R Pierce <pierce@hogranch.com> wrote: > On 5/12/2013 6:13 PM, David Boreham wrote: >> >> >> Not quite. More like : a) I don't know where to buy SLC drives in 2013 >> (all the drives for example for sale on newegg.com are MLC) and b) today's >> MLC drives are quite good enough for me (and I'd venture to say any >> database-related purpose). > > > Newegg wouldn't know 'enterprise' if it bit them. they just sell mass > market consumer stuff and gamer kit. > > the real SLC drives end up OEM branded in large SAN systems, such as sold by > Netapp, EMC, and are made by companies like STEC that have zero presence in > the 'whitebox' resale markets like Newegg. The industry decided a while back that MLC was basically the way to go in terms of cost and engineering trade-offs, at least in cases where you needed a lot of storage. Yes, you can still get SLC in mid-tier and up storage but: *) a lot of these drives are simply re-branded intel etc *) When it comes to SSD, I have zero confidence in vendor provided hardware specs (lifetime, iops, etc). The lack of 3rd party test coverage and performance benchmarking is a big problem for me. Ever bought a SAN and have had it not do what it was supposed to? *) The faster moving white box market has chosen MLC. Three years back, the jury was still out. This suggests to me that SAN vendors are still behind the curve in terms of SSD, which is typical of enterprise storage vendors. But, *) In many cases, the performance of the latest MLC drives is so fast that many applications that would have needed to scale up to high end storage would no longer need to do so. A software raid of say for s3700 drives would probably outperform most <100k SANs from a couple years back. merlin
So a week after asking our HP dealer, they've finally replied to say that they can't tell us what manufacturer and modelthe SSDs are because "HP treat this information as company confidential". Not particularly helpful. They have at least confirmed that the drives have "surprise power loss protection" and "tools to present information on thepercentage of life used and amount of life remaining under the workload-to-date". Given that these are enterprise class drives, and given that they have the high availability features that we would needin database servers, and given that the deadline on this project is very tight so I don't really have time to do anytesting on third-party drives, I'm guessing we'll go with the HP drives, even though they most likely are a little behindthe times. Whilst we will perhaps lose in a little bit of performance compared to the latest Intel drives, we willgain in terms of high availability reassurance and simplicity of deployment which is crucial for this project given itstight deadline. However, after going through all the advice on this thread and having had time to think, I'll probablygo for a four-disk RAID 10 array with SLCs, rather than a two-disk RAID 1 array with MLCs (for the OS) and a two-diskRAID 1 array with SLCs (for the database). If I had more time and resources for testing I would likely end up going a different route, however. Many thanks to all who've contributed their thoughts and opinions - much appreciated. Matt. On 13 May 2013, at 14:49, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sun, May 12, 2013 at 8:20 PM, John R Pierce <pierce@hogranch.com> wrote: >> On 5/12/2013 6:13 PM, David Boreham wrote: >>> >>> >>> Not quite. More like : a) I don't know where to buy SLC drives in 2013 >>> (all the drives for example for sale on newegg.com are MLC) and b) today's >>> MLC drives are quite good enough for me (and I'd venture to say any >>> database-related purpose). >> >> >> Newegg wouldn't know 'enterprise' if it bit them. they just sell mass >> market consumer stuff and gamer kit. >> >> the real SLC drives end up OEM branded in large SAN systems, such as sold by >> Netapp, EMC, and are made by companies like STEC that have zero presence in >> the 'whitebox' resale markets like Newegg. > > > The industry decided a while back that MLC was basically the way to go > in terms of cost and engineering trade-offs, at least in cases where > you needed a lot of storage. Yes, you can still get SLC in mid-tier > and up storage but: > > *) a lot of these drives are simply re-branded intel etc > *) When it comes to SSD, I have zero confidence in vendor provided > hardware specs (lifetime, iops, etc). The lack of 3rd party test > coverage and performance benchmarking is a big problem for me. Ever > bought a SAN and have had it not do what it was supposed to? > *) The faster moving white box market has chosen MLC. Three years > back, the jury was still out. This suggests to me that SAN vendors > are still behind the curve in terms of SSD, which is typical of > enterprise storage vendors. But, > *) In many cases, the performance of the latest MLC drives is so fast > that many applications that would have needed to scale up to high end > storage would no longer need to do so. A software raid of say for > s3700 drives would probably outperform most <100k SANs from a couple > years back. > > merlin > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On 11/05/13 02:25, Merlin Moncure wrote: > On Fri, May 10, 2013 at 11:11 AM, Evan D. Hoffman > <evandhoffman@gmail.com> wrote: >> Not sure of your space requirements, but I'd think a RAID 10 of 8x or more >> Samsung 840 Pro 256/512 GB would be the best value. Using a simple mirror >> won't get you the reliability that you want since heavy writing will burn >> the drives out over time, and if you're writing the exact same content to >> both drives, they could likely fail at the same time. Regardless of the >> underlying hardware you should still follow best practices for provisioning >> disks, and raid 10 is the way to go. I don't know what your budget is >> though. Anyway, mirrored SSD will probably work fine, but I'd avoid using >> just two drives for the reasons above. I'd suggest at least testing RAID 5 >> or something else to spread the load around. Personally, I think the ideal >> configuration would be a RAID 10 of at least 8 disks plus 1 hot spare. The >> Samsung 840 Pro 256 GB are frequently $200 on sale at Newegg. YMMV but they >> are amazing drives. > > Samsung 840 has no power loss protection and is therefore useless for > database use IMO unless you don't care about data safety and/or are > implementing redundancy via some other method (say, by synchronous > replication). I believe the original poster was referring to the "840 Pro" model; that model does include a "supercap" for power loss protection. -Toby
On 13/05/13 11:23, David Boreham wrote: > btw we deploy on CentOS6. The only things we change from the default are: > > 1. add "relatime,discard" options to the mount (check whether the most > recent CentOS6 does this itself -- it didn't back when we first deployed > on 6.0). While it is important to let the SSD know about space that can be reclaimed, I gather the operation does not perform well. I *think* current advice is to leave 'discard' off the mount options, and instead run a nightly cron job to call 'fstrim' on the mount point instead. (In really high write situations, you'd be looking at calling that every hour instead I suppose) I have to admit to have just gone with the advice, rather than benchmarking it thoroughly. tjc
On 5/19/2013 7:19 PM, Toby Corkindale wrote: > On 13/05/13 11:23, David Boreham wrote: >> btw we deploy on CentOS6. The only things we change from the default >> are: >> >> 1. add "relatime,discard" options to the mount (check whether the most >> recent CentOS6 does this itself -- it didn't back when we first deployed >> on 6.0). > > > While it is important to let the SSD know about space that can be > reclaimed, I gather the operation does not perform well. > I *think* current advice is to leave 'discard' off the mount options, > and instead run a nightly cron job to call 'fstrim' on the mount point > instead. (In really high write situations, you'd be looking at calling > that every hour instead I suppose) > > I have to admit to have just gone with the advice, rather than > benchmarking it thoroughly. The guy who blogged about this a couple of years ago was using a Sandforce controller drive. I'm not sure there is a similar issue with other drives. Certainly we've never noticed a problematic delay in file deletes. That said, our applications don't delete files too often (log file purging is probably the only place it happens regularly). Personally, in the absence of a clear and present issue, I'd prefer to go the "kernel guys and drive firmware guys will take care of this" route, and just enable discard on the mount.
On Sun, May 19, 2013 at 8:07 PM, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote: > On 11/05/13 02:25, Merlin Moncure wrote: >> >> On Fri, May 10, 2013 at 11:11 AM, Evan D. Hoffman >> <evandhoffman@gmail.com> wrote: >>> >>> Not sure of your space requirements, but I'd think a RAID 10 of 8x or >>> more >>> Samsung 840 Pro 256/512 GB would be the best value. Using a simple >>> mirror >>> won't get you the reliability that you want since heavy writing will burn >>> the drives out over time, and if you're writing the exact same content to >>> both drives, they could likely fail at the same time. Regardless of the >>> underlying hardware you should still follow best practices for >>> provisioning >>> disks, and raid 10 is the way to go. I don't know what your budget is >>> though. Anyway, mirrored SSD will probably work fine, but I'd avoid >>> using >>> just two drives for the reasons above. I'd suggest at least testing RAID >>> 5 >>> or something else to spread the load around. Personally, I think the >>> ideal >>> configuration would be a RAID 10 of at least 8 disks plus 1 hot spare. >>> The >>> Samsung 840 Pro 256 GB are frequently $200 on sale at Newegg. YMMV but >>> they >>> are amazing drives. >> >> >> Samsung 840 has no power loss protection and is therefore useless for >> database use IMO unless you don't care about data safety and/or are >> implementing redundancy via some other method (say, by synchronous >> replication). > > > > I believe the original poster was referring to the "840 Pro" model; that > model does include a "supercap" for power loss protection. got a source for that? I couldn't verify that after some googling. merlin
On 20/05/13 15:12, David Boreham wrote: > On 5/19/2013 7:19 PM, Toby Corkindale wrote: >> On 13/05/13 11:23, David Boreham wrote: >>> btw we deploy on CentOS6. The only things we change from the default >>> are: >>> >>> 1. add "relatime,discard" options to the mount (check whether the most >>> recent CentOS6 does this itself -- it didn't back when we first deployed >>> on 6.0). >> >> >> While it is important to let the SSD know about space that can be >> reclaimed, I gather the operation does not perform well. >> I *think* current advice is to leave 'discard' off the mount options, >> and instead run a nightly cron job to call 'fstrim' on the mount point >> instead. (In really high write situations, you'd be looking at calling >> that every hour instead I suppose) >> >> I have to admit to have just gone with the advice, rather than >> benchmarking it thoroughly. > > > The guy who blogged about this a couple of years ago was using a > Sandforce controller drive. > I'm not sure there is a similar issue with other drives. Certainly we've > never noticed a problematic delay in file deletes. > That said, our applications don't delete files too often (log file > purging is probably the only place it happens regularly). > > Personally, in the absence of a clear and present issue, I'd prefer to > go the "kernel guys and drive firmware guys will take care of this" > route, and just enable discard on the mount. This guy posted about a number of SSD drives, and enabling discard affected most of them quite negatively: http://people.redhat.com/lczerner/discard/ext4_discard.html http://people.redhat.com/lczerner/discard/files/Performance_evaluation_of_Linux_DIscard_support_Dev_Con2011_Brno.pdf That is from 2011 though, so you're right that things may have improved by now.. Has anyone seen benchmarks supporting that though? Toby
On 21/05/13 00:16, Merlin Moncure wrote: > On Sun, May 19, 2013 at 8:07 PM, Toby Corkindale > <toby.corkindale@strategicdata.com.au> wrote: >> On 11/05/13 02:25, Merlin Moncure wrote: >>> >>> On Fri, May 10, 2013 at 11:11 AM, Evan D. Hoffman >>> <evandhoffman@gmail.com> wrote: >>>> >>>> Not sure of your space requirements, but I'd think a RAID 10 of 8x or >>>> more >>>> Samsung 840 Pro 256/512 GB would be the best value. Using a simple >>>> mirror >>>> won't get you the reliability that you want since heavy writing will burn >>>> the drives out over time, and if you're writing the exact same content to >>>> both drives, they could likely fail at the same time. Regardless of the >>>> underlying hardware you should still follow best practices for >>>> provisioning >>>> disks, and raid 10 is the way to go. I don't know what your budget is >>>> though. Anyway, mirrored SSD will probably work fine, but I'd avoid >>>> using >>>> just two drives for the reasons above. I'd suggest at least testing RAID >>>> 5 >>>> or something else to spread the load around. Personally, I think the >>>> ideal >>>> configuration would be a RAID 10 of at least 8 disks plus 1 hot spare. >>>> The >>>> Samsung 840 Pro 256 GB are frequently $200 on sale at Newegg. YMMV but >>>> they >>>> are amazing drives. >>> >>> >>> Samsung 840 has no power loss protection and is therefore useless for >>> database use IMO unless you don't care about data safety and/or are >>> implementing redundancy via some other method (say, by synchronous >>> replication). >> >> >> >> I believe the original poster was referring to the "840 Pro" model; that >> model does include a "supercap" for power loss protection. > > got a source for that? I couldn't verify that after some googling. I'm sorry, I really thought they had made it onto my list of candidates that included supercaps.. now I'm checking again, I can't find any evidence to support that claim either. I must have confused them in my mind with another drive. Sorry about that, and thanks for checking. -Toby
On Tue, 21 May 2013 11:40:55 +1000, Toby Corkindale wrote: >>> While it is important to let the SSD know about space that can be >>> reclaimed, I gather the operation does not perform well. I *think* >>> current advice is to leave 'discard' off the mount options, and instead >>> run a nightly cron job to call 'fstrim' on the mount point instead. (In >>> really high write situations, you'd be looking at calling that every >>> hour instead I suppose) This is still a good idea - see below. >> The guy who blogged about this a couple of years ago was using a >> Sandforce controller drive. Btw that doesn't mean anything (neither in terms of performance nor stability), since "the controller" also needs to be paired with an - often vendor-dependent - firmware, which is much more relevant. Since LSI acquired Sandforce this situation has gotten much better (unified upstream). >> I'm not sure there is a similar issue with other drives. Certainly we've There is (now), because.. >> never noticed a problematic delay in file deletes. That said, our >> applications don't delete files too often (log file purging is probably >> the only place it happens regularly). >> >> Personally, in the absence of a clear and present issue, I'd prefer to >> go the "kernel guys and drive firmware guys will take care of this" >> route, and just enable discard on the mount. Nope, wrong, because.. (..getting there :) > That is from 2011 though, so you're right that things may have improved by > now.. Has anyone seen benchmarks supporting that though? Unfortunately since 3.8 discards are issued as synchronous commands, effectively disabling any scheduling/merging etc. The result can be seen easily: - mount drive without discard using kernel >= 3.8 - unpack kernel source - time delete of entire tree - remount with discard - unpack kernel tree - start delete of tree - ... - check it hasn't crashed - ... - go plant a tree or make babies while waiting for it to finish Online discard has gotten so slow that it's now a good idea to turn off for anything but light write workloads. Metadata-heavy writes are obviously the worst case. I experienced this on Samsung, Intel & a Sandforce-based drives, so "the controller" is no longer the primary reason for the performance impact. Extremely enterprisey drives *might* behave slightly better, but I doubt it; flash erase cycles are what they are. -h