Thread: New server: SSD/RAID recommendations?
We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.Linux 2.6.32Postgres 9.3Hardware:2 x INTEL WESTMERE 4C XEON 2.40GHZ12GB DDR3 ECC 1333MHz3WARE 9650SE-12ML with BBU12 x 1TB Hitachi 7200RPM SATA disksRAID 1 (2 disks)Linux partitionSwap partitionpg_xlog partitionRAID 10 (8 disks)Postgres database partitionWe get 5000-7000 TPS from pgbench on this system.The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?Thanks in advance for your advice!
Attachment
On Wed, Jul 1, 2015 at 5:06 PM, Craig James <cjames@emolecules.com> wrote: > We're buying a new server in the near future to replace an aging system. I'd > appreciate advice on the best SSD devices and RAID controller cards > available today. > > The database is about 750 GB. This is a "warehouse" server. We load supplier > catalogs throughout a typical work week, then on the weekend (after Q/A), > integrate the new supplier catalogs into our customer-visible "store", which > is then copied to a production server where customers see it. So the load is > mostly data loading, and essentially no OLTP. Typically there are fewer than > a dozen connections to Postgres. > > Linux 2.6.32 Upgrade to an OS with a later kernel, 3.11 at the lowest. 2.6.32 is broken from an IO perspective. It writes 2 to 4x more data than needed for normal operation. > Postgres 9.3 > Hardware: > 2 x INTEL WESTMERE 4C XEON 2.40GHZ > 12GB DDR3 ECC 1333MHz > 3WARE 9650SE-12ML with BBU > 12 x 1TB Hitachi 7200RPM SATA disks > RAID 1 (2 disks) > Linux partition > Swap partition > pg_xlog partition > RAID 10 (8 disks) > Postgres database partition > > We get 5000-7000 TPS from pgbench on this system. > > The new system will have at least as many CPUs, and probably a lot more > memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to > grow, so we'd like a 2TB file system for Postgres. We'll start with the > latest versions of Linux and Postgres. Once your db is bigger than memory, the size of the memory isn't as important as the speed of the IO. Being able to read and write huge swathes of data becomes more important than memory size at that point. Being able to read 100MB/s versus being able to read 1,000MB/s is the difference between 10 minute queries and 10 hour queries on a reporting box. For sequential throughput, i.e. loading and retreiving with only one or two clients connected, you can throw more and more spinners at it. If you're gonna have enough clients connected to make the array go from sequential to random access, then you want to try and put SSDs in there if possible, but the cost / Gig is much higher than spinners. ZFS can use SSDs as cache, as can some newer RAID controllers, which represents a compromise between the two. If you go with spinners, with or without ssd cache, throw as many at the problem as you can. And run them in RAID-10 if you possibly can. RAID-5 or 6 are much slower, especially on spinners. > What about a RAID controller? Are RAID controllers even available for > PCI-Express SSD drives, or do we have to stick with SATA if we need a > battery-backed RAID controller? Or is software RAID sufficient for SSD > drives? Not that I know of. PCI-E drives act as their own drive. You could software RAID them I guess. Or do you mean are there PCI-E controlelrs for SATA SSD drives? Plenty of those. Many modern controllers don't use battery backed cache, they've gone to flash memory, which requires no battery to survive powerdown. I like LSI, 3Ware and Areca RAID HBAs. > Are spinning disks still a good choice for the pg_xlog partition and OS? Is > there any reason to get spinning disks at all, or is it better/simpler to > just put everything on SSD drives? Spinning drives are fine for xlog and OS. If you're logging to the same drive set as pg_xlog is using, you will hit the wall faster. SSDs are great, until you need more space. I'd rather have an 8TB xlog partition of spinners when setting up replication and xlog archiving than a 500GB xlog partition. 8TB sounds like a lot until you need to hold on to a week's worth of xlog files on a busy server.
What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?
Quite a few of the benefits of using a hardware RAID controller are irrelevant when using modern SSDs. The great random write performance of the drives means the cache on the controller is less useful and the drives you’re considering (Intel’s enterprise grade) will have full power protection for inflight data.
In my own testing (CentOS 7/Postgres 9.4/128GB RAM/ 8x SSDs RAID5/10/0 with mdadm vs hw controllers) I’ve found that the RAID controller is actually limiting performance compared to just using software RAID. In worst-case workloads I’m able to saturate the controller with 2 SATA drives.
Another advantage in using mdadm is that it’ll properly pass TRIM to the drive. You’ll need to test whether “discard” in your fstab will have a negative impact on performance but being able to run “fstrim” occasionally will definitely help performance in the long run.
If you want another drive to consider you should look at the Micron M500DC. Full power protection for inflight data, same NAND as Intel uses in their drives, good mixed workload performance. (I’m obviously a little biased, though ;-)
Wes Vaske | Senior Storage Solutions Engineer
Micron Technology
101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Andreas Joseph Krogh
Sent: Wednesday, July 01, 2015 6:56 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James <cjames@emolecules.com>:
We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.
The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.
Linux 2.6.32
Postgres 9.3
Hardware:
2 x INTEL WESTMERE 4C XEON 2.40GHZ
12GB DDR3 ECC 1333MHz
3WARE 9650SE-12ML with BBU
12 x 1TB Hitachi 7200RPM SATA disks
RAID 1 (2 disks)
Linux partition
Swap partition
pg_xlog partition
RAID 10 (8 disks)
Postgres database partition
We get 5000-7000 TPS from pgbench on this system.
The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.
Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?
What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?
Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?
Thanks in advance for your advice!
Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in pg_largeobject, then move that to a separate tablespace with "archive-grade"-disks (spinning disks).
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment
On Wed, Jul 1, 2015 at 6:06 PM, Craig James <cjames@emolecules.com> wrote: > We're buying a new server in the near future to replace an aging system. I'd > appreciate advice on the best SSD devices and RAID controller cards > available today. > > The database is about 750 GB. This is a "warehouse" server. We load supplier > catalogs throughout a typical work week, then on the weekend (after Q/A), > integrate the new supplier catalogs into our customer-visible "store", which > is then copied to a production server where customers see it. So the load is > mostly data loading, and essentially no OLTP. Typically there are fewer than > a dozen connections to Postgres. > > Linux 2.6.32 > Postgres 9.3 > Hardware: > 2 x INTEL WESTMERE 4C XEON 2.40GHZ > 12GB DDR3 ECC 1333MHz > 3WARE 9650SE-12ML with BBU > 12 x 1TB Hitachi 7200RPM SATA disks > RAID 1 (2 disks) > Linux partition > Swap partition > pg_xlog partition > RAID 10 (8 disks) > Postgres database partition > > We get 5000-7000 TPS from pgbench on this system. > > The new system will have at least as many CPUs, and probably a lot more > memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to > grow, so we'd like a 2TB file system for Postgres. We'll start with the > latest versions of Linux and Postgres. > > Intel's products have always received good reports in this forum. Is that > still the best recommendation? Or are there good alternatives that are price > competitive? In my opinion, the intel S3500 still has incredible value. Sub 1$/gb and extremely fast. Heavily used both on production systems I manage and my personal workstation. This report: http://lkcl.net/reports/ssd_analysis.html told me everything I needed to know about the drive. If you are sustaining extremely high rates of writing data though particularly of the random kind, you need to factor in drive lifespan and may want to consider the S3700 or one of it's competitors. Both drives have been refreshed into the 3510 and 3710 modes but they are brand new and not highly reviewed so tread carefully. On my crapbox workstation I get about 5k random writes on large scale factor from a single device. I definitely support software raid and not picking up a fancy raid controller as long as you know your way around mdadm. Oh, and be sure to crank effective_io_concurrency: http://www.postgresql.org/message-id/CAHyXU0wgpE2E3B+rmZ959tJT_adPFfPvHNqeA9K9mkJRAT9HXw@mail.gmail.com merlin
What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?
Quite a few of the benefits of using a hardware RAID controller are irrelevant when using modern SSDs. The great random write performance of the drives means the cache on the controller is less useful and the drives you’re considering (Intel’s enterprise grade) will have full power protection for inflight data.
In my own testing (CentOS 7/Postgres 9.4/128GB RAM/ 8x SSDs RAID5/10/0 with mdadm vs hw controllers) I’ve found that the RAID controller is actually limiting performance compared to just using software RAID. In worst-case workloads I’m able to saturate the controller with 2 SATA drives.
Another advantage in using mdadm is that it’ll properly pass TRIM to the drive. You’ll need to test whether “discard” in your fstab will have a negative impact on performance but being able to run “fstrim” occasionally will definitely help performance in the long run.
If you want another drive to consider you should look at the Micron M500DC. Full power protection for inflight data, same NAND as Intel uses in their drives, good mixed workload performance. (I’m obviously a little biased, though ;-)
Wes Vaske | Senior Storage Solutions Engineer
Micron Technology
101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Andreas Joseph Krogh
Sent: Wednesday, July 01, 2015 6:56 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James <cjames@emolecules.com>:
We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.
The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.
Linux 2.6.32
Postgres 9.3
Hardware:
2 x INTEL WESTMERE 4C XEON 2.40GHZ
12GB DDR3 ECC 1333MHz
3WARE 9650SE-12ML with BBU
12 x 1TB Hitachi 7200RPM SATA disks
RAID 1 (2 disks)
Linux partition
Swap partition
pg_xlog partition
RAID 10 (8 disks)
Postgres database partition
We get 5000-7000 TPS from pgbench on this system.
The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.
Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?
What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?
Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?
Thanks in advance for your advice!
Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in pg_largeobject, then move that to a separate tablespace with "archive-grade"-disks (spinning disks).
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Craig A. James
Attachment
På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James <cjames@emolecules.com>:We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.Linux 2.6.32Postgres 9.3Hardware:2 x INTEL WESTMERE 4C XEON 2.40GHZ12GB DDR3 ECC 1333MHz3WARE 9650SE-12ML with BBU12 x 1TB Hitachi 7200RPM SATA disksRAID 1 (2 disks)Linux partitionSwap partitionpg_xlog partitionRAID 10 (8 disks)Postgres database partitionWe get 5000-7000 TPS from pgbench on this system.The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?Thanks in advance for your advice!Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in pg_largeobject, then move that to a separate tablespace with "archive-grade"-disks (spinning disks).
--Andreas Joseph KroghCTO / Partner - Visena ASMobile: +47 909 56 963
Craig A. James
Attachment
On Wed, Jul 1, 2015 at 5:06 PM, Craig James <cjames@emolecules.com> wrote:
> We're buying a new server in the near future to replace an aging system. I'd
> appreciate advice on the best SSD devices and RAID controller cards
> available today.
> ....
SSDs are great, until you need more space. I'd rather have an 8TB xlog
partition of spinners when setting up replication and xlog archiving
than a 500GB xlog partition. 8TB sounds like a lot until you need to
hold on to a week's worth of xlog files on a busy server.
Good point. I'll talk to our guy who does all the barman stuff about this.
Craig A. James
Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review
The only database-specific info we have available are for Cassandra and MSSQL:
(some of that info might be relevant)
In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison:
Micron M500DC (20nm) – 2 DWPD
Intel S3500 (20nm) – 0.3 DWPD
Intel S3510 (16nm) – 0.3 DWPD
Intel S3710 (20nm) – 10 DWPD
They’re all great drives, the question is how write-intensive is the workload.
Wes Vaske | Senior Storage Solutions Engineer
Micron Technology
101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728
Mobile: 515-451-7742
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Craig James
Sent: Thursday, July 02, 2015 12:20 PM
To: Wes Vaske (wvaske)
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
On Thu, Jul 2, 2015 at 7:01 AM, Wes Vaske (wvaske) <wvaske@micron.com> wrote:
What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?
Quite a few of the benefits of using a hardware RAID controller are irrelevant when using modern SSDs. The great random write performance of the drives means the cache on the controller is less useful and the drives you’re considering (Intel’s enterprise grade) will have full power protection for inflight data.
In my own testing (CentOS 7/Postgres 9.4/128GB RAM/ 8x SSDs RAID5/10/0 with mdadm vs hw controllers) I’ve found that the RAID controller is actually limiting performance compared to just using software RAID. In worst-case workloads I’m able to saturate the controller with 2 SATA drives.
Another advantage in using mdadm is that it’ll properly pass TRIM to the drive. You’ll need to test whether “discard” in your fstab will have a negative impact on performance but being able to run “fstrim” occasionally will definitely help performance in the long run.
If you want another drive to consider you should look at the Micron M500DC. Full power protection for inflight data, same NAND as Intel uses in their drives, good mixed workload performance. (I’m obviously a little biased, though ;-)
Thanks Wes. That's good advice. I've always liked mdadm and how well RAID is supported by Linux, and mostly used a controller for the cache and BBU.
I'll definitely check out your product. Can you point me to any benchmarks, both on performance and lifetime?
Craig
Wes Vaske | Senior Storage Solutions Engineer
Micron Technology
101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Andreas Joseph Krogh
Sent: Wednesday, July 01, 2015 6:56 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James <cjames@emolecules.com>:
We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.
The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.
Linux 2.6.32
Postgres 9.3
Hardware:
2 x INTEL WESTMERE 4C XEON 2.40GHZ
12GB DDR3 ECC 1333MHz
3WARE 9650SE-12ML with BBU
12 x 1TB Hitachi 7200RPM SATA disks
RAID 1 (2 disks)
Linux partition
Swap partition
pg_xlog partition
RAID 10 (8 disks)
Postgres database partition
We get 5000-7000 TPS from pgbench on this system.
The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.
Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?
What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?
Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?
Thanks in advance for your advice!
Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in pg_largeobject, then move that to a separate tablespace with "archive-grade"-disks (spinning disks).
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------
Attachment
What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?
Quite a few of the benefits of using a hardware RAID controller are irrelevant when using modern SSDs. The great random write performance of the drives means the cache on the controller is less useful and the drives you’re considering (Intel’s enterprise grade) will have full power protection for inflight data.
For what it's worth, in my most recent iteration I decided to go with the Intel Enterprise NVMe drives and no RAID. My reasoning was thus:
1. Modern SSDs are so fast that even if you had an infinitely fast RAID card you would still be severely constrained by the limits of SAS/SATA. To get the full speed advantages you have to connect directly into the bus.
2. We don't typically have redundant electronic components in our servers. Sure, we have dual power supplies and dual NICs (though generally to handle external failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no backup RAID card. Intel Enterprise SSD already have power-fail protection so I don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD I'm left to wonder if placing a RAID card in front merely adds a new point of failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at you, RAID backup battery).
3. I'm streaming to an entire redundant server and doing regular backups anyway so I'm covered for availability and recovery should the SSD (or anything else in the server) fail.
BTW, here's an article worth reading: https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/
Cheers,
Steve
On 07/06/2015 09:56 AM, Steve Crawford wrote: > On 07/02/2015 07:01 AM, Wes Vaske (wvaske) wrote: > For what it's worth, in my most recent iteration I decided to go with > the Intel Enterprise NVMe drives and no RAID. My reasoning was thus: > > 1. Modern SSDs are so fast that even if you had an infinitely fast RAID > card you would still be severely constrained by the limits of SAS/SATA. > To get the full speed advantages you have to connect directly into the bus. Correct. What we have done in the past is use smaller drives with RAID 10. This isn't for the performance but for the longevity of the drive. We obviously could do this with Software RAID or Hardware RAID. > > 2. We don't typically have redundant electronic components in our > servers. Sure, we have dual power supplies and dual NICs (though > generally to handle external failures) and ECC-RAM but no hot-backup CPU > or redundant RAM banks and...no backup RAID card. Intel Enterprise SSD > already have power-fail protection so I don't need a RAID card to give > me BBU. Given the MTBF of good enterprise SSD I'm left to wonder if > placing a RAID card in front merely adds a new point of failure and > scheduled-downtime-inducing hands-on maintenance (I'm looking at you, > RAID backup battery). That's an interesting question. It definitely adds yet another component. I can't believe how often we need to "hotfix" a raid controller. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended" is basically telling the world you can't control your own emotions, so everyone else should do it for you.
Completely agree with Steve. 1. Intel NVMe looks like the best bet if you have modern enough hardware for NVMe. Otherwise e.g. S3700 mentioned elsewhere. 2. RAID controllers. We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines. This might give people idea about where the risk lies in the path from disk to CPU. We've had 2 RAID card failures in the last 12 months that nuked the array with days of downtime, and 2 problems with batteriessuddenly becoming useless or suddenly reporting wildly varying temperatures/overheating. There may have been otherRAID problems I don't know about. Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I guess they have 100-200 disks?). We also haveabout 25-30 Hitachi/HGST HDDs. So by my estimates: 30% annual problem rate with RAID controllers 30-50% failure rate with Seagate HDDs (backblaze saw similar results) 0% failure rate with HGST HDDs. 0% failure in our SSDs. (to be fair, our one samsung SSD apparently has a bug in TRIM under linux, which I'll need to investigateto see if we have been affected by). also, RAID controllers aren't free - not just the money but also the management of them (ever tried writing a complex installscript that interacts work with MegaCLI? It can be done but it's not much fun.). Just take a look at the MegaCLI manualand ask yourself... is this even worth it (if you have a good MTBF on an enterprise SSD). RAID was meant to be about ensuring availability of data. I have trouble believing that these days.... Graeme Bell On 06 Jul 2015, at 18:56, Steve Crawford <scrawford@pinpointresearch.com> wrote: > > 2. We don't typically have redundant electronic components in our servers. Sure, we have dual power supplies and dual NICs(though generally to handle external failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no backupRAID card. Intel Enterprise SSD already have power-fail protection so I don't need a RAID card to give me BBU. Giventhe MTBF of good enterprise SSD I'm left to wonder if placing a RAID card in front merely adds a new point of failureand scheduled-downtime-inducing hands-on maintenance (I'm looking at you, RAID backup battery).
Thanks for the Info. So if RAID controllers are not an option, what one should use to build big databases? LVM with xfs? BtrFs? Zfs? Tigran. ----- Original Message ----- > From: "Graeme B. Bell" <graeme.bell@nibio.no> > To: "Steve Crawford" <scrawford@pinpointresearch.com> > Cc: "Wes Vaske (wvaske)" <wvaske@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org> > Sent: Tuesday, July 7, 2015 12:22:00 PM > Subject: Re: [PERFORM] New server: SSD/RAID recommendations? > Completely agree with Steve. > > 1. Intel NVMe looks like the best bet if you have modern enough hardware for > NVMe. Otherwise e.g. S3700 mentioned elsewhere. > > 2. RAID controllers. > > We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines. > This might give people idea about where the risk lies in the path from disk to > CPU. > > We've had 2 RAID card failures in the last 12 months that nuked the array with > days of downtime, and 2 problems with batteries suddenly becoming useless or > suddenly reporting wildly varying temperatures/overheating. There may have been > other RAID problems I don't know about. > > Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I > guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs. > > So by my estimates: > 30% annual problem rate with RAID controllers > 30-50% failure rate with Seagate HDDs (backblaze saw similar results) > 0% failure rate with HGST HDDs. > 0% failure in our SSDs. (to be fair, our one samsung SSD apparently has a bug > in TRIM under linux, which I'll need to investigate to see if we have been > affected by). > > also, RAID controllers aren't free - not just the money but also the management > of them (ever tried writing a complex install script that interacts work with > MegaCLI? It can be done but it's not much fun.). Just take a look at the > MegaCLI manual and ask yourself... is this even worth it (if you have a good > MTBF on an enterprise SSD). > > RAID was meant to be about ensuring availability of data. I have trouble > believing that these days.... > > Graeme Bell > > > On 06 Jul 2015, at 18:56, Steve Crawford <scrawford@pinpointresearch.com> wrote: > >> >> 2. We don't typically have redundant electronic components in our servers. Sure, >> we have dual power supplies and dual NICs (though generally to handle external >> failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no >> backup RAID card. Intel Enterprise SSD already have power-fail protection so I >> don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD >> I'm left to wonder if placing a RAID card in front merely adds a new point of >> failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at >> you, RAID backup battery). > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
I am unsure about the performance side but, ZFS is generally very attractive to me. Key advantages: 1) Checksumming and automatic fixing-of-broken-things on every file (not just postgres pages, but your scripts, O/S, programfiles). 2) Built-in lightweight compression (doesn't help with TOAST tables, in fact may slow them down, but helpful for other things).This may actually be a net negative for pg so maybe turn it off. 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's safe to replace a RAID array with a singledrive... you can use a couple of NVMe SSDs with ZFS mirror or zraid, and get the same availability you'd get froma RAID controller. Slightly better, arguably, since they claim to have fixed the raid write-hole problem. 4) filesystem snapshotting Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU with multiple NVMe drives will outperform quitea lot of the alternatives, with great data integrity guarantees. Haven't built one yet. Hope to, later this year. Steve, I would love to know more about how you're getting on with your NVMedisk in postgres! Graeme. On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran <tigran.mkrtchyan@desy.de> wrote: > Thanks for the Info. > > So if RAID controllers are not an option, what one should use to build > big databases? LVM with xfs? BtrFs? Zfs? > > Tigran. > > ----- Original Message ----- >> From: "Graeme B. Bell" <graeme.bell@nibio.no> >> To: "Steve Crawford" <scrawford@pinpointresearch.com> >> Cc: "Wes Vaske (wvaske)" <wvaske@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org> >> Sent: Tuesday, July 7, 2015 12:22:00 PM >> Subject: Re: [PERFORM] New server: SSD/RAID recommendations? > >> Completely agree with Steve. >> >> 1. Intel NVMe looks like the best bet if you have modern enough hardware for >> NVMe. Otherwise e.g. S3700 mentioned elsewhere. >> >> 2. RAID controllers. >> >> We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines. >> This might give people idea about where the risk lies in the path from disk to >> CPU. >> >> We've had 2 RAID card failures in the last 12 months that nuked the array with >> days of downtime, and 2 problems with batteries suddenly becoming useless or >> suddenly reporting wildly varying temperatures/overheating. There may have been >> other RAID problems I don't know about. >> >> Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I >> guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs. >> >> So by my estimates: >> 30% annual problem rate with RAID controllers >> 30-50% failure rate with Seagate HDDs (backblaze saw similar results) >> 0% failure rate with HGST HDDs. >> 0% failure in our SSDs. (to be fair, our one samsung SSD apparently has a bug >> in TRIM under linux, which I'll need to investigate to see if we have been >> affected by). >> >> also, RAID controllers aren't free - not just the money but also the management >> of them (ever tried writing a complex install script that interacts work with >> MegaCLI? It can be done but it's not much fun.). Just take a look at the >> MegaCLI manual and ask yourself... is this even worth it (if you have a good >> MTBF on an enterprise SSD). >> >> RAID was meant to be about ensuring availability of data. I have trouble >> believing that these days.... >> >> Graeme Bell >> >> >> On 06 Jul 2015, at 18:56, Steve Crawford <scrawford@pinpointresearch.com> wrote: >> >>> >>> 2. We don't typically have redundant electronic components in our servers. Sure, >>> we have dual power supplies and dual NICs (though generally to handle external >>> failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no >>> backup RAID card. Intel Enterprise SSD already have power-fail protection so I >>> don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD >>> I'm left to wonder if placing a RAID card in front merely adds a new point of >>> failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at >>> you, RAID backup battery). >> >> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance
----- Original Message ----- > From: "Graeme B. Bell" <graeme.bell@nibio.no> > To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de> > Cc: "Graeme B. Bell" <graeme.bell@nibio.no>, "Steve Crawford" <scrawford@pinpointresearch.com>, "Wes Vaske (wvaske)" > <wvaske@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org> > Sent: Tuesday, July 7, 2015 12:38:10 PM > Subject: Re: [PERFORM] New server: SSD/RAID recommendations? > I am unsure about the performance side but, ZFS is generally very attractive to > me. > > Key advantages: > > 1) Checksumming and automatic fixing-of-broken-things on every file (not just > postgres pages, but your scripts, O/S, program files). > 2) Built-in lightweight compression (doesn't help with TOAST tables, in fact > may slow them down, but helpful for other things). This may actually be a net > negative for pg so maybe turn it off. > 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's > safe to replace a RAID array with a single drive... you can use a couple of > NVMe SSDs with ZFS mirror or zraid, and get the same availability you'd get > from a RAID controller. Slightly better, arguably, since they claim to have > fixed the raid write-hole problem. > 4) filesystem snapshotting > > Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU > with multiple NVMe drives will outperform quite a lot of the alternatives, with > great data integrity guarantees. We are planing to have a test setup as well. For now I have single NVMe SSD on my test system: # lspci | grep NVM 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03) # mount | grep nvm /dev/nvme0n1p1 on /var/lib/pgsql/9.5 type ext4 (rw,noatime,nodiratime,data=ordered) and quite happy with it. We have write heavy workload on it to see when it will break. Postgres Performs very well. About x2.5 faster than with regular disks with a single client and almost linear with multiple clients (picture attached. On Y number of high level op/s our application does, X number of clients). The setup is used last 3 months. Looks promising but for production we need to to have disk size twice as big as on the test system. Until today, I was planning to use a RAID10 with a HW controller... Related to ZFS. We use ZFSonlinux and behaviour is not as good as with solaris. Let's re-phrase it: performance is unpredictable. We run READZ2 with 30x3TB disks. Tigran. > > Haven't built one yet. Hope to, later this year. Steve, I would love to know > more about how you're getting on with your NVMe disk in postgres! > > Graeme. > > On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran <tigran.mkrtchyan@desy.de> wrote: > >> Thanks for the Info. >> >> So if RAID controllers are not an option, what one should use to build >> big databases? LVM with xfs? BtrFs? Zfs? >> >> Tigran. >> >> ----- Original Message ----- >>> From: "Graeme B. Bell" <graeme.bell@nibio.no> >>> To: "Steve Crawford" <scrawford@pinpointresearch.com> >>> Cc: "Wes Vaske (wvaske)" <wvaske@micron.com>, "pgsql-performance" >>> <pgsql-performance@postgresql.org> >>> Sent: Tuesday, July 7, 2015 12:22:00 PM >>> Subject: Re: [PERFORM] New server: SSD/RAID recommendations? >> >>> Completely agree with Steve. >>> >>> 1. Intel NVMe looks like the best bet if you have modern enough hardware for >>> NVMe. Otherwise e.g. S3700 mentioned elsewhere. >>> >>> 2. RAID controllers. >>> >>> We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines. >>> This might give people idea about where the risk lies in the path from disk to >>> CPU. >>> >>> We've had 2 RAID card failures in the last 12 months that nuked the array with >>> days of downtime, and 2 problems with batteries suddenly becoming useless or >>> suddenly reporting wildly varying temperatures/overheating. There may have been >>> other RAID problems I don't know about. >>> >>> Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I >>> guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs. >>> >>> So by my estimates: >>> 30% annual problem rate with RAID controllers >>> 30-50% failure rate with Seagate HDDs (backblaze saw similar results) >>> 0% failure rate with HGST HDDs. >>> 0% failure in our SSDs. (to be fair, our one samsung SSD apparently has a bug >>> in TRIM under linux, which I'll need to investigate to see if we have been >>> affected by). >>> >>> also, RAID controllers aren't free - not just the money but also the management >>> of them (ever tried writing a complex install script that interacts work with >>> MegaCLI? It can be done but it's not much fun.). Just take a look at the >>> MegaCLI manual and ask yourself... is this even worth it (if you have a good >>> MTBF on an enterprise SSD). >>> >>> RAID was meant to be about ensuring availability of data. I have trouble >>> believing that these days.... >>> >>> Graeme Bell >>> >>> >>> On 06 Jul 2015, at 18:56, Steve Crawford <scrawford@pinpointresearch.com> wrote: >>> >>>> >>>> 2. We don't typically have redundant electronic components in our servers. Sure, >>>> we have dual power supplies and dual NICs (though generally to handle external >>>> failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no >>>> backup RAID card. Intel Enterprise SSD already have power-fail protection so I >>>> don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD >>>> I'm left to wonder if placing a RAID card in front merely adds a new point of >>>> failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at >>>> you, RAID backup battery). >>> >>> >>> >>> -- >>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Attachment
Lz4 compression and standard 128kb block size has shown to be materially faster here than using 8kb blocks and no compression, both with rotating disks and SSDs.----- Original Message -----From: "Graeme B. Bell" <graeme.bell@nibio.no> To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de> Cc: "Graeme B. Bell" <graeme.bell@nibio.no>, "Steve Crawford" <scrawford@pinpointresearch.com>, "Wes Vaske (wvaske)" <wvaske@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org> Sent: Tuesday, July 7, 2015 12:38:10 PM Subject: Re: [PERFORM] New server: SSD/RAID recommendations?I am unsure about the performance side but, ZFS is generally very attractive to me. Key advantages: 1) Checksumming and automatic fixing-of-broken-things on every file (not just postgres pages, but your scripts, O/S, program files). 2) Built-in lightweight compression (doesn't help with TOAST tables, in fact may slow them down, but helpful for other things). This may actually be a net negative for pg so maybe turn it off. 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's safe to replace a RAID array with a single drive... you can use a couple of NVMe SSDs with ZFS mirror or zraid, and get the same availability you'd get from a RAID controller. Slightly better, arguably, since they claim to have fixed the raid write-hole problem. 4) filesystem snapshotting Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU with multiple NVMe drives will outperform quite a lot of the alternatives, with great data integrity guarantees.
This is workload dependent in my experience but in the applications we put Postgres to there is a very material improvement in throughput using compression and the larger blocksize, which is counter-intuitive and also opposite the "conventional wisdom."
For best throughput we use mirrored vdev sets.
Attachment
Hi Karl, Great post, thanks. Though I don't think it's against conventional wisdom to aggregate writes into larger blocks rather than rely on 4k performanceon ssds :-) 128kb blocks + compression certainly makes sense. But it might make less sense I suppose if you had some incredibly highrate of churn in your rows. But for the work we do here, we could use 16MB blocks for all the difference it would make. (Tip to others: don't do that.128kb block performance is already enough out the IO bus to most ssds) Do you have your WAL log on a compressed zfs fs? Graeme Bell On 07 Jul 2015, at 13:28, Karl Denninger <karl@denninger.net> wrote: > Lz4 compression and standard 128kb block size has shown to be materially faster here than using 8kb blocks and no compression,both with rotating disks and SSDs. > > This is workload dependent in my experience but in the applications we put Postgres to there is a very material improvementin throughput using compression and the larger blocksize, which is counter-intuitive and also opposite the "conventionalwisdom." > > For best throughput we use mirrored vdev sets.
1. Does the sammy nvme have *complete* power loss protection though, for all fsync'd data? I am very badly burned by my experiences with Crucial SSDs and their 'power loss protection' which doesn't actually ensureall fsync'd data gets into flash. It certainly looks pretty with all those capacitors on top in the photos, but we need some plug pull tests to be sure. 2. Apologies for the typo in the previous post, raidz5 should have been raidz1. 3. Also, something to think about when you start having single disk solutions (or non-ZFS raid, for that matter). SSDs are so unlike HDDs. The samsung nvme has a UBER (uncorrectable bit error rate) measured at 1 in 10^17. That's one bit gone bad in 12500 TB, agood number. Chances are the drives fails before you hit a bit error, and if not, ZFS would catch it. Whereas current HDDS are at the 1 in 10^14 level. That means an error every 12TB, by the specs. That means, every time youfill your cheap 6-8TB seagate drive, it likely corrupted some of your data *even if it performed according to the spec*.(That's also why RAID5 isn't viable for rebuilding large arrays, incidentally). Graeme Bell On 07 Jul 2015, at 12:56, Mkrtchyan, Tigran <tigran.mkrtchyan@desy.de> wrote: > > > ----- Original Message ----- >> From: "Graeme B. Bell" <graeme.bell@nibio.no> >> To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de> >> Cc: "Graeme B. Bell" <graeme.bell@nibio.no>, "Steve Crawford" <scrawford@pinpointresearch.com>, "Wes Vaske (wvaske)" >> <wvaske@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org> >> Sent: Tuesday, July 7, 2015 12:38:10 PM >> Subject: Re: [PERFORM] New server: SSD/RAID recommendations? > >> I am unsure about the performance side but, ZFS is generally very attractive to >> me. >> >> Key advantages: >> >> 1) Checksumming and automatic fixing-of-broken-things on every file (not just >> postgres pages, but your scripts, O/S, program files). >> 2) Built-in lightweight compression (doesn't help with TOAST tables, in fact >> may slow them down, but helpful for other things). This may actually be a net >> negative for pg so maybe turn it off. >> 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's >> safe to replace a RAID array with a single drive... you can use a couple of >> NVMe SSDs with ZFS mirror or zraid, and get the same availability you'd get >> from a RAID controller. Slightly better, arguably, since they claim to have >> fixed the raid write-hole problem. >> 4) filesystem snapshotting >> >> Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU >> with multiple NVMe drives will outperform quite a lot of the alternatives, with >> great data integrity guarantees. > > > We are planing to have a test setup as well. For now I have single NVMe SSD on my > test system: > > # lspci | grep NVM > 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03) > > # mount | grep nvm > /dev/nvme0n1p1 on /var/lib/pgsql/9.5 type ext4 (rw,noatime,nodiratime,data=ordered) > > > and quite happy with it. We have write heavy workload on it to see when it will > break. Postgres Performs very well. About x2.5 faster than with regular disks > with a single client and almost linear with multiple clients (picture attached. > On Y number of high level op/s our application does, X number of clients). The > setup is used last 3 months. Looks promising but for production we need to > to have disk size twice as big as on the test system. Until today, I was > planning to use a RAID10 with a HW controller... > > Related to ZFS. We use ZFSonlinux and behaviour is not as good as with solaris. > Let's re-phrase it: performance is unpredictable. We run READZ2 with 30x3TB disks. > > Tigran. > >> >> Haven't built one yet. Hope to, later this year. Steve, I would love to know >> more about how you're getting on with your NVMe disk in postgres! >> >> Graeme. >> >> On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran <tigran.mkrtchyan@desy.de> wrote: >> >>> Thanks for the Info. >>> >>> So if RAID controllers are not an option, what one should use to build >>> big databases? LVM with xfs? BtrFs? Zfs? >>> >>> Tigran. >>> >>> ----- Original Message ----- >>>> From: "Graeme B. Bell" <graeme.bell@nibio.no> >>>> To: "Steve Crawford" <scrawford@pinpointresearch.com> >>>> Cc: "Wes Vaske (wvaske)" <wvaske@micron.com>, "pgsql-performance" >>>> <pgsql-performance@postgresql.org> >>>> Sent: Tuesday, July 7, 2015 12:22:00 PM >>>> Subject: Re: [PERFORM] New server: SSD/RAID recommendations? >>> >>>> Completely agree with Steve. >>>> >>>> 1. Intel NVMe looks like the best bet if you have modern enough hardware for >>>> NVMe. Otherwise e.g. S3700 mentioned elsewhere. >>>> >>>> 2. RAID controllers. >>>> >>>> We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines. >>>> This might give people idea about where the risk lies in the path from disk to >>>> CPU. >>>> >>>> We've had 2 RAID card failures in the last 12 months that nuked the array with >>>> days of downtime, and 2 problems with batteries suddenly becoming useless or >>>> suddenly reporting wildly varying temperatures/overheating. There may have been >>>> other RAID problems I don't know about. >>>> >>>> Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I >>>> guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs. >>>> >>>> So by my estimates: >>>> 30% annual problem rate with RAID controllers >>>> 30-50% failure rate with Seagate HDDs (backblaze saw similar results) >>>> 0% failure rate with HGST HDDs. >>>> 0% failure in our SSDs. (to be fair, our one samsung SSD apparently has a bug >>>> in TRIM under linux, which I'll need to investigate to see if we have been >>>> affected by). >>>> >>>> also, RAID controllers aren't free - not just the money but also the management >>>> of them (ever tried writing a complex install script that interacts work with >>>> MegaCLI? It can be done but it's not much fun.). Just take a look at the >>>> MegaCLI manual and ask yourself... is this even worth it (if you have a good >>>> MTBF on an enterprise SSD). >>>> >>>> RAID was meant to be about ensuring availability of data. I have trouble >>>> believing that these days.... >>>> >>>> Graeme Bell >>>> >>>> >>>> On 06 Jul 2015, at 18:56, Steve Crawford <scrawford@pinpointresearch.com> wrote: >>>> >>>>> >>>>> 2. We don't typically have redundant electronic components in our servers. Sure, >>>>> we have dual power supplies and dual NICs (though generally to handle external >>>>> failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no >>>>> backup RAID card. Intel Enterprise SSD already have power-fail protection so I >>>>> don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD >>>>> I'm left to wonder if placing a RAID card in front merely adds a new point of >>>>> failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at >>>>> you, RAID backup battery). >>>> >>>> >>>> >>>> -- >>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >>>> To make changes to your subscription: >>>> http://www.postgresql.org/mailpref/pgsql-performance >> >> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance > <pg-with-ssd.png>
Yes.Hi Karl, Great post, thanks. Though I don't think it's against conventional wisdom to aggregate writes into larger blocks rather than rely on 4k performance on ssds :-) 128kb blocks + compression certainly makes sense. But it might make less sense I suppose if you had some incredibly high rate of churn in your rows. But for the work we do here, we could use 16MB blocks for all the difference it would make. (Tip to others: don't do that. 128kb block performance is already enough out the IO bus to most ssds) Do you have your WAL log on a compressed zfs fs? Graeme Bell
Data goes on one mirrored set of vdevs, pg_xlog goes on a second, separate pool. WAL goes on a third pool on RaidZ2. WAL typically goes on rotating storage since I use it (and a basebackup) as disaster recovery (and in hot spare apps the source for the syncing hot standbys) and that's nearly a big-block-write-only data stream. Rotating media is fine for that in most applications. I take a new basebackup on reasonable intervals and rotate the WAL logs to keep that from growing without boundary.
I use LSI host adapters for the drives themselves (no hardware RAID); I'm currently running on FreeBSD 10.1. Be aware that ZFS on FreeBSD has some fairly nasty issues that I developed (and publish) a patch for; without it some workloads can result in very undesirable behavior where working set gets paged out in favor of ZFS ARC; if that happens your performance will go straight into the toilet.
Back before FreeBSD 9 when ZFS was simply not stable enough for me I used ARECA hardware RAID adapters and rotating media with BBUs and large cache memory installed on them with UFS filesystems. Hardware adapters are, however, a net lose in a ZFS environment even when they nominally work well (and they frequently interact very badly with ZFS during certain operations making them just flat-out unsuitable.) All-in I far prefer ZFS on a host adapter to UFS on a RAID adapter both from a data integrity and performance standpoint.
My SSD drives of choice are all Intel; for lower-end requirements the 730s work very well; the S3500 is next and if your write volume is high enough the S3700 has much greater endurance (but at a correspondingly higher price.) All three are properly power-fail protected. All three are much, much faster than rotating storage. If you can saturate the SATA channels and need still more I/O throughput NVMe drives are the next quantum up in performance; I'm not there with our application at the present time.
Incidentally while there are people who have questioned the 730 series power loss protection I've tested it with plug-pulls and in addition it watchdogs its internal power loss capacitors -- from the smartctl -a display of one of them on an in-service machine here:
175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 643 (4 6868)
Attachment
Thanks, this is very useful to know about the 730. When you say 'tested it with plug-pulls', you were using diskchecker.pl,right? Graeme. On 07 Jul 2015, at 14:39, Karl Denninger <karl@denninger.net> wrote: > > Incidentally while there are people who have questioned the 730 series power loss protection I've tested it with plug-pullsand in addition it watchdogs its internal power loss capacitors -- from the smartctl -a display of one of themon an in-service machine here: > > 175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 643 (4 6868)
Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review
The only database-specific info we have available are for Cassandra and MSSQL:
(some of that info might be relevant)
In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison:
Micron M500DC (20nm) – 2 DWPD
Intel S3500 (20nm) – 0.3 DWPD
Intel S3510 (16nm) – 0.3 DWPD
Intel S3710 (20nm) – 10 DWPD
They’re all great drives, the question is how write-intensive is the workload.
As I have warned elsewhere, The M500/M550 from $SOME_COMPANY is NOT SUITABLE for postgres unless you have a RAID controller with BBU to protect yourself. The M500/M550 are NOT plug-pull safe despite the 'power loss protection' claimed on the packaging. Not all fsync'd data ispreserved in the event of a power loss, which completely undermines postgres's sanity. I would be extremely skeptical about the M500DC given the name and manufacturer. I went to quite a lot of trouble to provide $SOME_COMPANYs engineers with the full details of this fault after extensivetesting (we have e.g. 20-25 of these disks) on multiple machines and controllers, at their request. Result: theystopped replying to me, and soon after I saw their PR reps talking about how 'power loss protection isn't about protectingall data during a power loss'. The only safe way to use an M500/M550 with postgres is: a) disable the disk cache, which will cripple performance to about 3-5% of normal. b) use a battery backed or cap-backed RAID controller, which will generally hurt performance, by limiting you to the peakperformance of the flash on the raid controller. If you are buying such a drive, I strongly recommend buying only one and doing extensive plug pull testing before commitingto several. For myself, my time is valuable enough that it will be cheaper to buy intel in future. Graeme. On 07 Jul 2015, at 15:12, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) <wvaske@micron.com> wrote: > Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review > > > > The only database-specific info we have available are for Cassandra and MSSQL: > > http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf > > http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf > > > > (some of that info might be relevant) > > > > In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison: > > Micron M500DC (20nm) – 2 DWPD > > Intel S3500 (20nm) – 0.3 DWPD > > Intel S3510 (16nm) – 0.3 DWPD > > Intel S3710 (20nm) – 10 DWPD > > > > They’re all great drives, the question is how write-intensive is the workload. > > > > > Intel added a new product, the 3610, that is rated for 3 DWPD. Pricing looks to be around 1.20$/GB. > > merlin
The M500/M550/M600 are consumer class drives that don't have power protection for all inflight data.* (like the Samsung 8x0series and the Intel 3x0 & 5x0 series). The M500DC has full power protection for inflight data and is an enterprise-class drive (like the Samsung 845DC or IntelS3500 & S3700 series). So any drive without the capacitors to protect inflight data will suffer from data loss if you're using disk write cacheand you pull the power. *Big addendum: There are two issues on powerloss that will mess with Postgres. Data Loss and Data Corruption. The micron consumer driveswill have power loss protection against Data Corruption and the enterprise drive will have power loss protection againstBOTH. https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf The Data Corruption problem is only an issue in non-SLC NAND but it's industry wide. And even though some drives will protectagainst that, the protection of inflight data that's been fsync'd is more important and should disqualify *any* consumerdrives from *any* company from consideration for use with Postgres. Wes Vaske | Senior Storage Solutions Engineer Micron Technology -----Original Message----- From: Graeme B. Bell [mailto:graeme.bell@nibio.no] Sent: Tuesday, July 07, 2015 8:26 AM To: Merlin Moncure Cc: Wes Vaske (wvaske); Craig James; pgsql-performance@postgresql.org Subject: Re: [PERFORM] New server: SSD/RAID recommendations? As I have warned elsewhere, The M500/M550 from $SOME_COMPANY is NOT SUITABLE for postgres unless you have a RAID controller with BBU to protect yourself. The M500/M550 are NOT plug-pull safe despite the 'power loss protection' claimed on the packaging. Not all fsync'd data ispreserved in the event of a power loss, which completely undermines postgres's sanity. I would be extremely skeptical about the M500DC given the name and manufacturer. I went to quite a lot of trouble to provide $SOME_COMPANYs engineers with the full details of this fault after extensivetesting (we have e.g. 20-25 of these disks) on multiple machines and controllers, at their request. Result: theystopped replying to me, and soon after I saw their PR reps talking about how 'power loss protection isn't about protectingall data during a power loss'. The only safe way to use an M500/M550 with postgres is: a) disable the disk cache, which will cripple performance to about 3-5% of normal. b) use a battery backed or cap-backed RAID controller, which will generally hurt performance, by limiting you to the peakperformance of the flash on the raid controller. If you are buying such a drive, I strongly recommend buying only one and doing extensive plug pull testing before commitingto several. For myself, my time is valuable enough that it will be cheaper to buy intel in future. Graeme. On 07 Jul 2015, at 15:12, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) <wvaske@micron.com> wrote: > Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review > > > > The only database-specific info we have available are for Cassandra and MSSQL: > > http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf > > http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf > > > > (some of that info might be relevant) > > > > In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison: > > Micron M500DC (20nm) - 2 DWPD > > Intel S3500 (20nm) - 0.3 DWPD > > Intel S3510 (16nm) - 0.3 DWPD > > Intel S3710 (20nm) - 10 DWPD > > > > They're all great drives, the question is how write-intensive is the workload. > > > > > Intel added a new product, the 3610, that is rated for 3 DWPD. Pricing looks to be around 1.20$/GB. > > merlin
On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote: > The M500/M550/M600 are consumer class drives that don't have power > protection for all inflight data.* (like the Samsung 8x0 series and > the Intel 3x0 & 5x0 series). > > The M500DC has full power protection for inflight data and is an > enterprise-class drive (like the Samsung 845DC or Intel S3500 & S3700 > series). > > So any drive without the capacitors to protect inflight data will > suffer from data loss if you're using disk write cache and you pull > the power. Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling the power plug. > *Big addendum: There are two issues on powerloss that will mess with > Postgres. Data Loss and Data Corruption. The micron consumer drives > will have power loss protection against Data Corruption and the > enterprise drive will have power loss protection against BOTH. > > https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf > > The Data Corruption problem is only an issue in non-SLC NAND but > it's industry wide. And even though some drives will protect against > that, the protection of inflight data that's been fsync'd is more > important and should disqualify *any* consumer drives from *any* > company from consideration for use with Postgres. So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. - Heikki
Hi Wes 1. The first interesting thing is that prior to my mentioning this problem to C_____ a year or two back, the power loss protectionwas advertised everywhere as simply that, without qualifiers about 'not inflight data'. Check out the marketingof the M500 for the first year or so and try to find an example where they say 'but inflight data isn't protected!'. 2. The second (and more important) interesting thing is that this is irrelevant! Fsync'd data is BY DEFINITION not data in flight. Fsync means "This data is secure on the disk!" However, the drives corrupt it. Postgres's sanity depends on a reliable fsync. That's why we see posts on the performance list saying 'fsync=no makes yourpostgres faster but really, don't do it in production". We are talking about internal DB corruption, not just a crash and a few lost transactions. These drives return from fsync while data is still in volatile cache. That's breaking the spec, and it's why they are not OK for postgres by themselves. This is not about 'in-flight' data, it's about fsync'd wal log data. Graeme. On 07 Jul 2015, at 16:15, Wes Vaske (wvaske) <wvaske@micron.com> wrote: > The M500/M550/M600 are consumer class drives that don't have power protection for all inflight data.* (like the Samsung8x0 series and the Intel 3x0 & 5x0 series). > > The M500DC has full power protection for inflight data and is an enterprise-class drive (like the Samsung 845DC or IntelS3500 & S3700 series). > > So any drive without the capacitors to protect inflight data will suffer from data loss if you're using disk write cacheand you pull the power. > > *Big addendum: > There are two issues on powerloss that will mess with Postgres. Data Loss and Data Corruption. The micron consumer driveswill have power loss protection against Data Corruption and the enterprise drive will have power loss protection againstBOTH. > > https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf > > The Data Corruption problem is only an issue in non-SLC NAND but it's industry wide. And even though some drives will protectagainst that, the protection of inflight data that's been fsync'd is more important and should disqualify *any* consumerdrives from *any* company from consideration for use with Postgres. > > Wes Vaske | Senior Storage Solutions Engineer > Micron Technology > > -----Original Message----- > From: Graeme B. Bell [mailto:graeme.bell@nibio.no] > Sent: Tuesday, July 07, 2015 8:26 AM > To: Merlin Moncure > Cc: Wes Vaske (wvaske); Craig James; pgsql-performance@postgresql.org > Subject: Re: [PERFORM] New server: SSD/RAID recommendations? > > > As I have warned elsewhere, > > The M500/M550 from $SOME_COMPANY is NOT SUITABLE for postgres unless you have a RAID controller with BBU to protect yourself. > The M500/M550 are NOT plug-pull safe despite the 'power loss protection' claimed on the packaging. Not all fsync'd datais preserved in the event of a power loss, which completely undermines postgres's sanity. > > I would be extremely skeptical about the M500DC given the name and manufacturer. > > I went to quite a lot of trouble to provide $SOME_COMPANYs engineers with the full details of this fault after extensivetesting (we have e.g. 20-25 of these disks) on multiple machines and controllers, at their request. Result: theystopped replying to me, and soon after I saw their PR reps talking about how 'power loss protection isn't about protectingall data during a power loss'. > > The only safe way to use an M500/M550 with postgres is: > > a) disable the disk cache, which will cripple performance to about 3-5% of normal. > b) use a battery backed or cap-backed RAID controller, which will generally hurt performance, by limiting you to the peakperformance of the flash on the raid controller. > > If you are buying such a drive, I strongly recommend buying only one and doing extensive plug pull testing before commitingto several. > For myself, my time is valuable enough that it will be cheaper to buy intel in future. > > Graeme. > > On 07 Jul 2015, at 15:12, Merlin Moncure <mmoncure@gmail.com> wrote: > >> On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) <wvaske@micron.com> wrote: >> Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review >> >> >> >> The only database-specific info we have available are for Cassandra and MSSQL: >> >> http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf >> >> http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf >> >> >> >> (some of that info might be relevant) >> >> >> >> In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison: >> >> Micron M500DC (20nm) - 2 DWPD >> >> Intel S3500 (20nm) - 0.3 DWPD >> >> Intel S3510 (16nm) - 0.3 DWPD >> >> Intel S3710 (20nm) - 10 DWPD >> >> >> >> They're all great drives, the question is how write-intensive is the workload. >> >> >> >> >> Intel added a new product, the 3610, that is rated for 3 DWPD. Pricing looks to be around 1.20$/GB. >> >> merlin >
Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks... diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected M500s/M550s,so I think it's pretty much open to debate what types of madness and corruption you'll find if you look closeenough. G On 07 Jul 2015, at 16:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'ddata? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that ifB is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. > > - Heikki
Hi.
How would BBU cache help you if it lies about fsync? I suppose any RAID controller removes data from BBU cache after it was fsynced by the drive. As I know, there is no other "magic command" for drive to tell controller that the data is safe now and can be removed from BBU cache.
Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks...
diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected M500s/M550s, so I think it's pretty much open to debate what types of madness and corruption you'll find if you look close enough.
G
On 07 Jul 2015, at 16:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway.
>
> - Heikki
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
> On 07 Jul 2015, at 16:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> >> So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'ddata? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that ifB is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. On Tue, Jul 7, 2015 at 10:58 AM, Graeme B. Bell <graeme.bell@nibio.no> wrote: > > Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks... > > diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected M500s/M550s,so I think it's pretty much open to debate what types of madness and corruption you'll find if you look closeenough. 100% agree with your sentiments. I do believe that there are other enterprise SSD vendors that offer reliable parts but not at the price point intel does for the cheaper drives. The consumer grade vendors are simply not trustworthy unless proven otherwise (I had my own unpleasant experience with OCZ for example). Intel played the same game with their early parts but have since become a model of how to ship drives to the market. RAID controllers are completely unnecessary for SSD as they currently exist. Software raid is superior in every way; the hardware features of raid controllers, BBU, write caching, and write consolidation are redundant to what the SSD themselves do (being themselves RAID 0 basically). A hypothetical SSD optimized raid controller is possible; it could do things like balance wear and optimize writes across multiple physical drives. This would require deep participation between the drive and the controller and FWICT no such things exists excepting super expensive sans which I don't recommend anyways. merlin
> > RAID controllers are completely unnecessary for SSD as they currently > exist. Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho. In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs init as a starting point. So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it has workedvery well. Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and bypass thecache. The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme andtrash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck, I'llprobably put one in my next gaming PC. Re: software raid. I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at ZFS andthinking 'why the heck am I even using software raid?' G
That is a very good question, which I have raised elsewhere on the postgresql lists previously. In practice: I have *never* managed to make diskchecker fail with the BBU enabled in front of the drives and I spent daystrying with plug pulls till I reached the point where as a statistical event it just can't be that likely at all. That'snot to say it can't ever happen, just that I've taken all reasonable measures that I can to find out on the time andmoney budget I had available. In theory: It may be the fact the BBU makes the drives run at about half speed, so that the capacitors go a good bit furtherto empty the cache, after all: without the BBU in the way, the drive manages to save everything but the last fragmentof writes. But I also suspect that the controller itself maybe replaying the last set of writes from around the timeof power loss. Anyway I'm 50/50 on those two explanations. Any other thoughts welcome. This raises another interesting question. Does anyone hear have a document explaining how their BBU cache works EXACTLY (atcache / sata level) on their server? Because I haven't been able to find any for mine (Dell PERC H710/H710P). Can anyonetell me with godlike authority and precision, what exactly happens inside that BBU post-power failure? There is rather too much magic involved for me to be happy. G On 07 Jul 2015, at 18:27, Vitalii Tymchyshyn <vit@tym.im> wrote: > Hi. > > How would BBU cache help you if it lies about fsync? I suppose any RAID controller removes data from BBU cache after itwas fsynced by the drive. As I know, there is no other "magic command" for drive to tell controller that the data is safenow and can be removed from BBU cache. > > Вт, 7 лип. 2015 11:59 Graeme B. Bell <graeme.bell@nibio.no> пише: > > Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks... > > diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected M500s/M550s,so I think it's pretty much open to debate what types of madness and corruption you'll find if you look closeenough. > > G > > > On 07 Jul 2015, at 16:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > > So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'ddata? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that ifB is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. > > > > - Heikki > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
>
> RAID controllers are completely unnecessary for SSD as they currently
> exist.
Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho.
In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs in it as a starting point.
So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it has worked very well.
Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and bypass the cache.
The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme and trash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck, I'll probably put one in my next gaming PC.
Re: software raid.
I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at ZFS and thinking 'why the heck am I even using software raid?'
G
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Ang Wei Shan
> > This raises another interesting question. Does anyone hear have a document explaining how their BBU cache works EXACTLY(at cache / sata level) on their server? Because I haven't been able to find any for mine (Dell PERC H710/H710P).Can anyone tell me with godlike authority and precision, what exactly happens inside that BBU post-power failure? (and if you have that manual - how can you know it's accurate? that the implementation matches the manual and is free ofbugs? because my M500s didn't match the packaging and neither did a H710 we bought - Dell had advertised features in somemarketing material that were only present on the H710P) And I see UBER (unrecoverable bit error) rates for SSDs and HDDs, but has anyone ever seen them for the flash-based cacheon their raid controller? Sleep well, friends. Graeme. On 07 Jul 2015, at 18:54, Graeme B. Bell <graeme.bell@nibio.no> wrote: > > That is a very good question, which I have raised elsewhere on the postgresql lists previously. > > In practice: I have *never* managed to make diskchecker fail with the BBU enabled in front of the drives and I spent daystrying with plug pulls till I reached the point where as a statistical event it just can't be that likely at all. That'snot to say it can't ever happen, just that I've taken all reasonable measures that I can to find out on the time andmoney budget I had available. > > In theory: It may be the fact the BBU makes the drives run at about half speed, so that the capacitors go a good bit furtherto empty the cache, after all: without the BBU in the way, the drive manages to save everything but the last fragmentof writes. But I also suspect that the controller itself maybe replaying the last set of writes from around the timeof power loss. > > Anyway I'm 50/50 on those two explanations. Any other thoughts welcome. > > This raises another interesting question. Does anyone hear have a document explaining how their BBU cache works EXACTLY(at cache / sata level) on their server? Because I haven't been able to find any for mine (Dell PERC H710/H710P).Can anyone tell me with godlike authority and precision, what exactly happens inside that BBU post-power failure? > > There is rather too much magic involved for me to be happy. > > G > > On 07 Jul 2015, at 18:27, Vitalii Tymchyshyn <vit@tym.im> wrote: > >> Hi. >> >> How would BBU cache help you if it lies about fsync? I suppose any RAID controller removes data from BBU cache after itwas fsynced by the drive. As I know, there is no other "magic command" for drive to tell controller that the data is safenow and can be removed from BBU cache. >> >> Вт, 7 лип. 2015 11:59 Graeme B. Bell <graeme.bell@nibio.no> пише: >> >> Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks... >> >> diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotectedM500s/M550s, so I think it's pretty much open to debate what types of madness and corruption you'll find if youlook close enough. >> >> G >> >> >> On 07 Jul 2015, at 16:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >>> >>> So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'ddata? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that ifB is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. >>> >>> - Heikki >> >> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >
root@Dbms2:/var/tmp # ./diskchecker.pl -s newfs verify /test/biteme
verifying: 0.00%
verifying: 3.81%
verifying: 10.91%
verifying: 18.71%
verifying: 26.46%
verifying: 33.95%
verifying: 41.20%
verifying: 49.48%
verifying: 57.23%
verifying: 64.89%
verifying: 72.54%
verifying: 80.04%
verifying: 87.96%
verifying: 95.15%
verifying: 100.00%
Total errors: 0
da6 at mps0 bus 0 scbus0 target 17 lun 0
da6: <ATA INTEL SSDSC2BP24 0420> Fixed Direct Access SPC-4 SCSI device
da6: Serial Number BTJR446401KW240AGN
da6: 600.000MB/s transfers
da6: Command Queueing enabled
da6: 228936MB (468862128 512 byte sectors: 255H 63S/T 29185C)
# smartctl -a /dev/da6
=== START OF INFORMATION SECTION ===
Model Family: Intel 730 and DC S3500/S3700 Series SSDs
Device Model: INTEL SSDSC2BP240G4
Serial Number: BTJR446401KW240AGN
LU WWN Device Id: 5 5cd2e4 04b71afc7
Firmware Version: L2010420
User Capacity: 240,057,409,536 bytes [240 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jul 7 17:01:36 2015 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Note -- same firmware between all three series of Intel devices...... :-)
Yes, I like these SSDs -- they don't lie and they don't lose data on a power-pull.
Thanks, this is very useful to know about the 730. When you say 'tested it with plug-pulls', you were using diskchecker.pl, right? Graeme. On 07 Jul 2015, at 14:39, Karl Denninger <karl@denninger.net> wrote:Incidentally while there are people who have questioned the 730 series power loss protection I've tested it with plug-pulls and in addition it watchdogs its internal power loss capacitors -- from the smartctl -a display of one of them on an in-service machine here: 175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 643 (4 6868)
Attachment
> Why would you think that you don't need RAID for ZFS? > > Reason I'm asking if because we are moving to ZFS on FreeBSD for our future projects. Because you have zraid. :-) https://blogs.oracle.com/bonwick/entry/raid_z General points: 1. It's my understanding that ZFS is designed to talk to the hardware directly, and so it would be bad to hide the physicallayer from ZFS unless you had to. After all, I don't think they implemented a raid-like system inside ZFS just for the fun of it. 2. You have zraid built in and easy to manage within ZFS - and well tested compared to NewRaidController (TM) - why add anotherlayer of management to your disk storage? 3. You reintroduce the raid write hole. 4. There might be some argument for hardware raid (existing system) but with software raid (the point I was addressing) itmakes little sense at all. 5. If you're on hardware raid and your controller dies, you're screwed in several ways. It's harder to get a new raid controllerthan a new pc. Your chances of recovery are lower than zfs. IMHO more scary to recover from a failed raid controller,too. 6. Recovery is faster if the disks aren't full. e.g. ZFS recovers what it is there. This might not seem a big deal but chancesare it would save you 50% of your downtime in a crisis. However, I think with Linux you might want to use RAID for the boot disk. I don't know if linux can boot from ZFS yet. Iwould (and am) using Freebsd with zfs. Graeme. On 07 Jul 2015, at 18:56, Wei Shan <weishan.ang@gmail.com> wrote: > Hi Graeme, > > Why would you think that you don't need RAID for ZFS? > > Reason I'm asking if because we are moving to ZFS on FreeBSD for our future projects. > > Regards, > Wei Shan > > On 8 July 2015 at 00:46, Graeme B. Bell <graeme.bell@nibio.no> wrote: > > > > RAID controllers are completely unnecessary for SSD as they currently > > exist. > > Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho. > > In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs init as a starting point. > So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it hasworked very well. > Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and bypassthe cache. > > The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme andtrash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck, I'llprobably put one in my next gaming PC. > > Re: software raid. > > I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at ZFSand thinking 'why the heck am I even using software raid?' > > G > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > > > > -- > Regards, > Ang Wei Shan
On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote:The M500/M550/M600 are consumer class drives that don't have power
protection for all inflight data.* (like the Samsung 8x0 series and
the Intel 3x0 & 5x0 series).
The M500DC has full power protection for inflight data and is an
enterprise-class drive (like the Samsung 845DC or Intel S3500 & S3700
series).
So any drive without the capacitors to protect inflight data will
suffer from data loss if you're using disk write cache and you pull
the power.
Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling the power plug.
*Big addendum: There are two issues on powerloss that will mess with
Postgres. Data Loss and Data Corruption. The micron consumer drives
will have power loss protection against Data Corruption and the
enterprise drive will have power loss protection against BOTH.
https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf
The Data Corruption problem is only an issue in non-SLC NAND but
it's industry wide. And even though some drives will protect against
that, the protection of inflight data that's been fsync'd is more
important and should disqualify *any* consumer drives from *any*
company from consideration for use with Postgres.
So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway.
- Heikki
--
The comment on HDDs is true and gave me another thought. These new 'shingled' HDDs (the 8TB ones) rely on rewriting all the data on tracks that overlap your data, any time you changethe data. Result: disks 8-20x slower during writes, after they fill up. Do they have power loss protection for the data being rewritten during reshingling? You could have data commited at positionX and you accidentally nuke data at position Y. [I know that using a shingled disk sounds crazy (it sounds crazy to me) but you can bet there are people that just want tomax out the disk bays in their server... ] Graeme. On 07 Jul 2015, at 19:28, Michael Nolan <htfoot@gmail.com> wrote: > > > On Tue, Jul 7, 2015 at 10:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote: > The M500/M550/M600 are consumer class drives that don't have power > protection for all inflight data.* (like the Samsung 8x0 series and > the Intel 3x0 & 5x0 series). > > The M500DC has full power protection for inflight data and is an > enterprise-class drive (like the Samsung 845DC or Intel S3500 & S3700 > series). > > So any drive without the capacitors to protect inflight data will > suffer from data loss if you're using disk write cache and you pull > the power. > > Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling thepower plug. > > That can (and does) happen with spinning disks, too. > > > *Big addendum: There are two issues on powerloss that will mess with > Postgres. Data Loss and Data Corruption. The micron consumer drives > will have power loss protection against Data Corruption and the > enterprise drive will have power loss protection against BOTH. > > https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf > > The Data Corruption problem is only an issue in non-SLC NAND but > it's industry wide. And even though some drives will protect against > that, the protection of inflight data that's been fsync'd is more > important and should disqualify *any* consumer drives from *any* > company from consideration for use with Postgres. > > So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'ddata? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that ifB is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. > > - Heikki > > > The sad fact is that MANY drives (ssd as well as spinning) lie about their fsync status. > -- > Mike Nolan >
On Tue, Jul 7, 2015 at 11:43 AM, Graeme B. Bell <graeme.bell@nibio.no> wrote: > > The comment on HDDs is true and gave me another thought. > > These new 'shingled' HDDs (the 8TB ones) rely on rewriting all the data on tracks that overlap your data, any time youchange the data. Result: disks 8-20x slower during writes, after they fill up. > > Do they have power loss protection for the data being rewritten during reshingling? You could have data commited at positionX and you accidentally nuke data at position Y. > > [I know that using a shingled disk sounds crazy (it sounds crazy to me) but you can bet there are people that just wantto max out the disk bays in their server... ] Let's just say no online backup companies are using those disks. :) Biggest current production spinners being used I know of are 4TB, non-shingled.
On 07 Jul 2015, at 19:47, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> [I know that using a shingled disk sounds crazy (it sounds crazy to me) but you can bet there are people that just wantto max out the disk bays in their server... ] > > Let's just say no online backup companies are using those disks. :) I'm not so sure. Literally the most famous online backup company is (or was planning to): https://www.backblaze.com/blog/6-tb-hard-drive-face-off/ But I think that a massive read-only archive really is the only use for these things. I hope they go out of fashion, soon. But I was thinking more of the 'small company postgres server' or 'charitable organisation postgres server'. Someone is going to make this mistake, you can bet. Probably not someone on THIS list, of course... > Biggest current production spinners being used I know of are 4TB, > non-shingled. I think we may have some 6TB WD reds around here. I'll need to look around. G
Regarding:
“lie about their fsync status.”
This is mostly semantics but it might help google searches on the issue.
A drive doesn’t support fsync(), that’s a filesystem/kernel process. A drive will do a FLUSH CACHE. Before kernels 2.6.<low numbers> the fsync() call wouldn’t sent any ATA or SCSI command to flush the disk cache. Whereas—AFAICT—modern kernels and file system versions *will* do this. When ‘sync’ is called the filesystem will issue the appropriate command to the disk to flush the write cache.
For ATA, this is “FLUSH CACHE” (E7h). To check support for the command use:
[root@postgres ~]# smartctl --identify /dev/sdu | grep "FLUSH CACHE"
83 13 1 FLUSH CACHE EXT supported
83 12 1 FLUSH CACHE supported
86 13 1 FLUSH CACHE EXT supported
86 12 1 FLUSH CACHE supported
The 1s in the 3rd column represent SUPPORTED for the feature listed in the last column.
Cheers,
Wes Vaske
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Michael Nolan
Sent: Tuesday, July 07, 2015 12:28 PM
To: hlinnaka@iki.fi
Cc: Wes Vaske (wvaske); Graeme B. Bell; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
On Tue, Jul 7, 2015 at 10:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote:
The M500/M550/M600 are consumer class drives that don't have power
protection for all inflight data.* (like the Samsung 8x0 series and
the Intel 3x0 & 5x0 series).
The M500DC has full power protection for inflight data and is an
enterprise-class drive (like the Samsung 845DC or Intel S3500 & S3700
series).
So any drive without the capacitors to protect inflight data will
suffer from data loss if you're using disk write cache and you pull
the power.
Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling the power plug.
That can (and does) happen with spinning disks, too.
*Big addendum: There are two issues on powerloss that will mess with
Postgres. Data Loss and Data Corruption. The micron consumer drives
will have power loss protection against Data Corruption and the
enterprise drive will have power loss protection against BOTH.
https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf
The Data Corruption problem is only an issue in non-SLC NAND but
it's industry wide. And even though some drives will protect against
that, the protection of inflight data that's been fsync'd is more
important and should disqualify *any* consumer drives from *any*
company from consideration for use with Postgres.
So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway.
- Heikki
The sad fact is that MANY drives (ssd as well as spinning) lie about their fsync status.
--
Mike Nolan
On Tue, Jul 7, 2015 at 11:46 AM, Graeme B. Bell <graeme.bell@nibio.no> wrote: >> >> RAID controllers are completely unnecessary for SSD as they currently >> exist. > > Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho. > > In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs init as a starting point. > So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it hasworked very well. > Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and bypassthe cache. > > The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme andtrash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck, I'llprobably put one in my next gaming PC. > > Re: software raid. > > I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at ZFSand thinking 'why the heck am I even using software raid?' Good point. At least for me, I've yet to jump on the ZFS bandwagon and so don't have an opinion on it. merlin
On 07/07/2015 09:01 PM, Wes Vaske (wvaske) wrote: > Regarding: > “lie about their fsync status.” > > This is mostly semantics but it might help google searches on the issue. > > A drive doesn’t support fsync(), that’s a filesystem/kernel process. A drive will do a FLUSH CACHE. Before kernels 2.6.<lownumbers> the fsync() call wouldn’t sent any ATA or SCSI command to flush the disk cache. Whereas—AFAICT—modern kernelsand file system versions*will* do this. When ‘sync’ is called the filesystem will issue the appropriate command tothe disk to flush the write cache. > > For ATA, this is “FLUSH CACHE” (E7h). To check support for the command use: > [root@postgres ~]# smartctl --identify /dev/sdu | grep "FLUSH CACHE" > 83 13 1 FLUSH CACHE EXT supported > 83 12 1 FLUSH CACHE supported > 86 13 1 FLUSH CACHE EXT supported > 86 12 1 FLUSH CACHE supported > > The 1s in the 3rd column represent SUPPORTED for the feature listed in the last column. Right, to be precise, the problem isn't the drive lies about fsync(). It lies about FLUSH CACHE instead. Search & replace fsync() with FLUSH CACHE, and the same question remains: When the drive breaks its promise wrt. FLUSH CACHE, does it nevertheless guarantee that the order the data is eventually flushed to disk is consistent with the order in which the data and FLUSH CACHE were sent to the drive? That's an important distinction, because it makes the difference between "the most recent data the application saved might be lost even though the FLUSH CACHE command returned" and "your filesystem is corrupt". - Heikki
Cache flushing isn't an atomic operation though. Even if the ordering is right, you are likely to have a partial fsync onthe disk when the lights go out - isn't your FS still corrupt? On 07 Jul 2015, at 21:53, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 07/07/2015 09:01 PM, Wes Vaske (wvaske) wrote: > > Right, to be precise, the problem isn't the drive lies about fsync(). It lies about FLUSH CACHE instead. Search & replacefsync() with FLUSH CACHE, and the same question remains: When the drive breaks its promise wrt. FLUSH CACHE, doesit nevertheless guarantee that the order the data is eventually flushed to disk is consistent with the order in whichthe data and FLUSH CACHE were sent to the drive? That's an important distinction, because it makes the difference between"the most recent data the application saved might be lost even though the FLUSH CACHE command returned" and "yourfilesystem is corrupt". >
On 07/07/2015 10:59 PM, Graeme B. Bell wrote: > Cache flushing isn't an atomic operation though. Even if the ordering > is right, you are likely to have a partial fsync on the disk when the > lights go out - isn't your FS still corrupt? If the filesystem is worth its salt, no. Journaling filesystems for example rely on the journal to work around that problem, and there are other mechanisms. PostgreSQL has exactly the same problem and uses the WAL to solve it. - Heikki