Thread: New server: SSD/RAID recommendations?

New server: SSD/RAID recommendations?

From
Craig James
Date:
We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.

The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.

Linux 2.6.32
Postgres 9.3
Hardware:
  2 x INTEL WESTMERE 4C XEON 2.40GHZ
  12GB DDR3 ECC 1333MHz
  3WARE 9650SE-12ML with BBU
  12 x 1TB Hitachi 7200RPM SATA disks
RAID 1 (2 disks)
   Linux partition
   Swap partition
   pg_xlog partition
RAID 10 (8 disks)
   Postgres database partition

We get 5000-7000 TPS from pgbench on this system.

The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.

Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?

What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?

Thanks in advance for your advice!

Craig

Re: New server: SSD/RAID recommendations?

From
Andreas Joseph Krogh
Date:
På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James <cjames@emolecules.com>:
We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.
 
The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.
 
Linux 2.6.32
Postgres 9.3
Hardware:
  2 x INTEL WESTMERE 4C XEON 2.40GHZ
  12GB DDR3 ECC 1333MHz
  3WARE 9650SE-12ML with BBU
  12 x 1TB Hitachi 7200RPM SATA disks
RAID 1 (2 disks)
   Linux partition
   Swap partition
   pg_xlog partition
RAID 10 (8 disks)
   Postgres database partition
 
We get 5000-7000 TPS from pgbench on this system.
 
The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.
 
Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?
 
What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?
 
Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?
 
Thanks in advance for your advice!
 
 
Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in pg_largeobject, then move that to a separate tablespace with "archive-grade"-disks (spinning disks).
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: New server: SSD/RAID recommendations?

From
Scott Marlowe
Date:
On Wed, Jul 1, 2015 at 5:06 PM, Craig James <cjames@emolecules.com> wrote:
> We're buying a new server in the near future to replace an aging system. I'd
> appreciate advice on the best SSD devices and RAID controller cards
> available today.
>
> The database is about 750 GB. This is a "warehouse" server. We load supplier
> catalogs throughout a typical work week, then on the weekend (after Q/A),
> integrate the new supplier catalogs into our customer-visible "store", which
> is then copied to a production server where customers see it. So the load is
> mostly data loading, and essentially no OLTP. Typically there are fewer than
> a dozen connections to Postgres.
>
> Linux 2.6.32

Upgrade to an OS with a later kernel, 3.11 at the lowest. 2.6.32 is
broken from an IO perspective. It writes 2 to 4x more data than needed
for normal operation.

> Postgres 9.3
> Hardware:
>   2 x INTEL WESTMERE 4C XEON 2.40GHZ
>   12GB DDR3 ECC 1333MHz
>   3WARE 9650SE-12ML with BBU
>   12 x 1TB Hitachi 7200RPM SATA disks
> RAID 1 (2 disks)
>    Linux partition
>    Swap partition
>    pg_xlog partition
> RAID 10 (8 disks)
>    Postgres database partition
>
> We get 5000-7000 TPS from pgbench on this system.
>
> The new system will have at least as many CPUs, and probably a lot more
> memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to
> grow, so we'd like a 2TB file system for Postgres. We'll start with the
> latest versions of Linux and Postgres.

Once your db is bigger than memory, the size of the memory isn't as
important as the speed of the IO. Being able to read and write huge
swathes of data becomes more important than memory size at that point.
Being able to read 100MB/s versus being able to read 1,000MB/s is the
difference between 10 minute queries and 10 hour queries on a
reporting box. For sequential throughput, i.e. loading and retreiving
with only one or two clients connected, you can throw more and more
spinners at it. If you're gonna have enough clients connected to make
the array go from sequential to random access, then you want to try
and put SSDs in there if possible, but the cost / Gig is much higher
than spinners.

ZFS can use SSDs as cache, as can some newer RAID controllers, which
represents a compromise between the two.

If you go with spinners, with or without ssd cache, throw as many at
the problem as you can. And run them in RAID-10 if you possibly can.
RAID-5 or 6 are much slower, especially on spinners.

> What about a RAID controller? Are RAID controllers even available for
> PCI-Express SSD drives, or do we have to stick with SATA if we need a
> battery-backed RAID controller? Or is software RAID sufficient for SSD
> drives?

Not that I know of. PCI-E drives act as their own drive. You could
software RAID them I guess. Or do you mean are there PCI-E controlelrs
for SATA SSD drives? Plenty of those.

Many modern controllers don't use battery backed cache, they've gone
to flash memory, which requires no battery to survive powerdown. I
like LSI, 3Ware and Areca RAID HBAs.


> Are spinning disks still a good choice for the pg_xlog partition and OS? Is
> there any reason to get spinning disks at all, or is it better/simpler to
> just put everything on SSD drives?

Spinning drives are fine for xlog and OS. If you're logging to the
same drive set as pg_xlog is using, you will hit the wall faster.

SSDs are great, until you need more space. I'd rather have an 8TB xlog
partition of spinners when setting up replication and xlog archiving
than a 500GB xlog partition. 8TB sounds like a lot until you need to
hold on to a week's worth of xlog files on a busy server.


Re: New server: SSD/RAID recommendations?

From
"Wes Vaske (wvaske)"
Date:

What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

 

Quite a few of the benefits of using a hardware RAID controller are irrelevant when using modern SSDs. The great random write performance of the drives means the cache on the controller is less useful and the drives you’re considering (Intel’s enterprise grade) will have full power protection for inflight data.

 

In my own testing (CentOS 7/Postgres 9.4/128GB RAM/ 8x SSDs RAID5/10/0 with mdadm vs hw controllers) I’ve found that the RAID controller is actually limiting performance compared to just using software RAID. In worst-case workloads I’m able to saturate the controller with 2 SATA drives.

 

Another advantage in using mdadm is that it’ll properly pass TRIM to the drive. You’ll need to test whether “discard” in your fstab will have a negative impact on performance but being able to run “fstrim” occasionally will definitely help performance in the long run.

 

If you want another drive to consider you should look at the Micron M500DC. Full power protection for inflight data, same NAND as Intel uses in their drives, good mixed workload performance. (I’m obviously a little biased, though ;-)

 

Wes Vaske | Senior Storage Solutions Engineer

Micron Technology

101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Andreas Joseph Krogh
Sent: Wednesday, July 01, 2015 6:56 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?

 

På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James <cjames@emolecules.com>:

We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.

 

The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.

 

Linux 2.6.32

Postgres 9.3

Hardware:

  2 x INTEL WESTMERE 4C XEON 2.40GHZ

  12GB DDR3 ECC 1333MHz

  3WARE 9650SE-12ML with BBU

  12 x 1TB Hitachi 7200RPM SATA disks

RAID 1 (2 disks)

   Linux partition

   Swap partition

   pg_xlog partition

RAID 10 (8 disks)

   Postgres database partition

 

We get 5000-7000 TPS from pgbench on this system.

 

The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.

 

Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?

 

What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

 

Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?

 

Thanks in advance for your advice!

 

 

Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in pg_largeobject, then move that to a separate tablespace with "archive-grade"-disks (spinning disks).

 

--

Andreas Joseph Krogh

CTO / Partner - Visena AS

Mobile: +47 909 56 963

 

Attachment

Re: New server: SSD/RAID recommendations?

From
Merlin Moncure
Date:
On Wed, Jul 1, 2015 at 6:06 PM, Craig James <cjames@emolecules.com> wrote:
> We're buying a new server in the near future to replace an aging system. I'd
> appreciate advice on the best SSD devices and RAID controller cards
> available today.
>
> The database is about 750 GB. This is a "warehouse" server. We load supplier
> catalogs throughout a typical work week, then on the weekend (after Q/A),
> integrate the new supplier catalogs into our customer-visible "store", which
> is then copied to a production server where customers see it. So the load is
> mostly data loading, and essentially no OLTP. Typically there are fewer than
> a dozen connections to Postgres.
>
> Linux 2.6.32
> Postgres 9.3
> Hardware:
>   2 x INTEL WESTMERE 4C XEON 2.40GHZ
>   12GB DDR3 ECC 1333MHz
>   3WARE 9650SE-12ML with BBU
>   12 x 1TB Hitachi 7200RPM SATA disks
> RAID 1 (2 disks)
>    Linux partition
>    Swap partition
>    pg_xlog partition
> RAID 10 (8 disks)
>    Postgres database partition
>
> We get 5000-7000 TPS from pgbench on this system.
>
> The new system will have at least as many CPUs, and probably a lot more
> memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to
> grow, so we'd like a 2TB file system for Postgres. We'll start with the
> latest versions of Linux and Postgres.
>
> Intel's products have always received good reports in this forum. Is that
> still the best recommendation? Or are there good alternatives that are price
> competitive?

In my opinion, the intel S3500 still has incredible value.  Sub 1$/gb
and extremely fast.   Heavily used both on production systems I manage
and my personal workstation.   This report:
http://lkcl.net/reports/ssd_analysis.html told me everything I needed
to know about the drive.   If you are sustaining extremely high rates
of writing data though particularly of the random kind, you need to
factor in drive lifespan and may want to consider the S3700 or one of
it's competitors.   Both drives have been refreshed into the 3510 and
3710 modes but they are brand new and not highly reviewed so tread
carefully.   On my crapbox workstation I get about 5k random writes on
large scale factor from a single device.

I definitely support software raid and not picking up a fancy raid
controller as long as you know your way around mdadm.   Oh, and be
sure to crank effective_io_concurrency:
http://www.postgresql.org/message-id/CAHyXU0wgpE2E3B+rmZ959tJT_adPFfPvHNqeA9K9mkJRAT9HXw@mail.gmail.com

merlin


Re: New server: SSD/RAID recommendations?

From
Craig James
Date:
On Thu, Jul 2, 2015 at 7:01 AM, Wes Vaske (wvaske) <wvaske@micron.com> wrote:

What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

 

Quite a few of the benefits of using a hardware RAID controller are irrelevant when using modern SSDs. The great random write performance of the drives means the cache on the controller is less useful and the drives you’re considering (Intel’s enterprise grade) will have full power protection for inflight data.

 

In my own testing (CentOS 7/Postgres 9.4/128GB RAM/ 8x SSDs RAID5/10/0 with mdadm vs hw controllers) I’ve found that the RAID controller is actually limiting performance compared to just using software RAID. In worst-case workloads I’m able to saturate the controller with 2 SATA drives.

 

Another advantage in using mdadm is that it’ll properly pass TRIM to the drive. You’ll need to test whether “discard” in your fstab will have a negative impact on performance but being able to run “fstrim” occasionally will definitely help performance in the long run.

 

If you want another drive to consider you should look at the Micron M500DC. Full power protection for inflight data, same NAND as Intel uses in their drives, good mixed workload performance. (I’m obviously a little biased, though ;-)


Thanks Wes. That's good advice. I've always liked mdadm and how well RAID is supported by Linux, and mostly used a controller for the cache and BBU.

I'll definitely check out your product. Can you point me to any benchmarks, both on performance and lifetime?

Craig

 

Wes Vaske | Senior Storage Solutions Engineer

Micron Technology

101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Andreas Joseph Krogh
Sent: Wednesday, July 01, 2015 6:56 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?

 

På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James <cjames@emolecules.com>:

We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.

 

The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.

 

Linux 2.6.32

Postgres 9.3

Hardware:

  2 x INTEL WESTMERE 4C XEON 2.40GHZ

  12GB DDR3 ECC 1333MHz

  3WARE 9650SE-12ML with BBU

  12 x 1TB Hitachi 7200RPM SATA disks

RAID 1 (2 disks)

   Linux partition

   Swap partition

   pg_xlog partition

RAID 10 (8 disks)

   Postgres database partition

 

We get 5000-7000 TPS from pgbench on this system.

 

The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.

 

Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?

 

What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

 

Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?

 

Thanks in advance for your advice!

 

 

Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in pg_largeobject, then move that to a separate tablespace with "archive-grade"-disks (spinning disks).

 

--

Andreas Joseph Krogh

CTO / Partner - Visena AS

Mobile: +47 909 56 963

 




--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------
Attachment

Re: New server: SSD/RAID recommendations?

From
Craig James
Date:
On Wed, Jul 1, 2015 at 4:56 PM, Andreas Joseph Krogh <andreas@visena.com> wrote:
På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James <cjames@emolecules.com>:
We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.
 
The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.
 
Linux 2.6.32
Postgres 9.3
Hardware:
  2 x INTEL WESTMERE 4C XEON 2.40GHZ
  12GB DDR3 ECC 1333MHz
  3WARE 9650SE-12ML with BBU
  12 x 1TB Hitachi 7200RPM SATA disks
RAID 1 (2 disks)
   Linux partition
   Swap partition
   pg_xlog partition
RAID 10 (8 disks)
   Postgres database partition
 
We get 5000-7000 TPS from pgbench on this system.
 
The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.
 
Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?
 
What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?
 
Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?
 
Thanks in advance for your advice!
 
 
Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in pg_largeobject, then move that to a separate tablespace with "archive-grade"-disks (spinning disks).

No blobs in our database, so that sounds like good advice. It simplifies the hardware a lot if we can go with just SSDs.

Craig
 
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 



--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------
Attachment

Re: New server: SSD/RAID recommendations?

From
Craig James
Date:
On Wed, Jul 1, 2015 at 4:57 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Wed, Jul 1, 2015 at 5:06 PM, Craig James <cjames@emolecules.com> wrote:
> We're buying a new server in the near future to replace an aging system. I'd
> appreciate advice on the best SSD devices and RAID controller cards
> available today.
> ....
SSDs are great, until you need more space. I'd rather have an 8TB xlog
partition of spinners when setting up replication and xlog archiving
than a 500GB xlog partition. 8TB sounds like a lot until you need to
hold on to a week's worth of xlog files on a busy server.

Good point. I'll talk to our guy who does all the barman stuff about this.

Craig

--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

Re: New server: SSD/RAID recommendations?

From
"Wes Vaske (wvaske)"
Date:

Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review

 

The only database-specific info we have available are for Cassandra and MSSQL:

http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf

http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf

 

(some of that info might be relevant)

 

In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison:

Micron M500DC (20nm) – 2 DWPD

Intel S3500 (20nm) – 0.3 DWPD

Intel S3510 (16nm) – 0.3 DWPD

Intel S3710 (20nm) – 10 DWPD

 

They’re all great drives, the question is how write-intensive is the workload.

 

Wes Vaske | Senior Storage Solutions Engineer

Micron Technology

101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728

Mobile: 515-451-7742

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Craig James
Sent: Thursday, July 02, 2015 12:20 PM
To: Wes Vaske (wvaske)
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?

 

On Thu, Jul 2, 2015 at 7:01 AM, Wes Vaske (wvaske) <wvaske@micron.com> wrote:

What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

 

Quite a few of the benefits of using a hardware RAID controller are irrelevant when using modern SSDs. The great random write performance of the drives means the cache on the controller is less useful and the drives you’re considering (Intel’s enterprise grade) will have full power protection for inflight data.

 

In my own testing (CentOS 7/Postgres 9.4/128GB RAM/ 8x SSDs RAID5/10/0 with mdadm vs hw controllers) I’ve found that the RAID controller is actually limiting performance compared to just using software RAID. In worst-case workloads I’m able to saturate the controller with 2 SATA drives.

 

Another advantage in using mdadm is that it’ll properly pass TRIM to the drive. You’ll need to test whether “discard” in your fstab will have a negative impact on performance but being able to run “fstrim” occasionally will definitely help performance in the long run.

 

If you want another drive to consider you should look at the Micron M500DC. Full power protection for inflight data, same NAND as Intel uses in their drives, good mixed workload performance. (I’m obviously a little biased, though ;-)

 

Thanks Wes. That's good advice. I've always liked mdadm and how well RAID is supported by Linux, and mostly used a controller for the cache and BBU.

 

I'll definitely check out your product. Can you point me to any benchmarks, both on performance and lifetime?

 

Craig

 

 

Wes Vaske | Senior Storage Solutions Engineer

Micron Technology

101 West Louis Henna Blvd, Suite 210 | Austin, TX 78728

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Andreas Joseph Krogh
Sent: Wednesday, July 01, 2015 6:56 PM
To:
pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?

 

På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James <cjames@emolecules.com>:

We're buying a new server in the near future to replace an aging system. I'd appreciate advice on the best SSD devices and RAID controller cards available today.

 

The database is about 750 GB. This is a "warehouse" server. We load supplier catalogs throughout a typical work week, then on the weekend (after Q/A), integrate the new supplier catalogs into our customer-visible "store", which is then copied to a production server where customers see it. So the load is mostly data loading, and essentially no OLTP. Typically there are fewer than a dozen connections to Postgres.

 

Linux 2.6.32

Postgres 9.3

Hardware:

  2 x INTEL WESTMERE 4C XEON 2.40GHZ

  12GB DDR3 ECC 1333MHz

  3WARE 9650SE-12ML with BBU

  12 x 1TB Hitachi 7200RPM SATA disks

RAID 1 (2 disks)

   Linux partition

   Swap partition

   pg_xlog partition

RAID 10 (8 disks)

   Postgres database partition

 

We get 5000-7000 TPS from pgbench on this system.

 

The new system will have at least as many CPUs, and probably a lot more memory (196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so we'd like a 2TB file system for Postgres. We'll start with the latest versions of Linux and Postgres.

 

Intel's products have always received good reports in this forum. Is that still the best recommendation? Or are there good alternatives that are price competitive?

 

What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

 

Are spinning disks still a good choice for the pg_xlog partition and OS? Is there any reason to get spinning disks at all, or is it better/simpler to just put everything on SSD drives?

 

Thanks in advance for your advice!

 

 

Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in pg_largeobject, then move that to a separate tablespace with "archive-grade"-disks (spinning disks).

 

--

Andreas Joseph Krogh

CTO / Partner - Visena AS

Mobile: +47 909 56 963

/media/maillist_attaches/pgsql-performance/2015/07/2/31701C89A44B714FB8D242F806E7286C5AED33EB@NTXBOIMBX01.micron.com/image001.png

 



 

--

---------------------------------
Craig A. James

Chief Technology Officer

eMolecules, Inc.

---------------------------------

Attachment

Re: New server: SSD/RAID recommendations?

From
Steve Crawford
Date:
On 07/02/2015 07:01 AM, Wes Vaske (wvaske) wrote:

What about a RAID controller? Are RAID controllers even available for PCI-Express SSD drives, or do we have to stick with SATA if we need a battery-backed RAID controller? Or is software RAID sufficient for SSD drives?

 

Quite a few of the benefits of using a hardware RAID controller are irrelevant when using modern SSDs. The great random write performance of the drives means the cache on the controller is less useful and the drives you’re considering (Intel’s enterprise grade) will have full power protection for inflight data.


For what it's worth, in my most recent iteration I decided to go with the Intel Enterprise NVMe drives and no RAID. My reasoning was thus:

1. Modern SSDs are so fast that even if you had an infinitely fast RAID card you would still be severely constrained by the limits of SAS/SATA. To get the full speed advantages you have to connect directly into the bus.

2. We don't typically have redundant electronic components in our servers. Sure, we have dual power supplies and dual NICs (though generally to handle external failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no backup RAID card. Intel Enterprise SSD already have power-fail protection so I don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD I'm left to wonder if placing a RAID card in front merely adds a new point of failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at you, RAID backup battery).

3. I'm streaming to an entire redundant server and doing regular backups anyway so I'm covered for availability and recovery should the SSD (or anything else in the server) fail.

BTW, here's an article worth reading: https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/

Cheers,
Steve

Re: New server: SSD/RAID recommendations?

From
"Joshua D. Drake"
Date:
On 07/06/2015 09:56 AM, Steve Crawford wrote:
> On 07/02/2015 07:01 AM, Wes Vaske (wvaske) wrote:

> For what it's worth, in my most recent iteration I decided to go with
> the Intel Enterprise NVMe drives and no RAID. My reasoning was thus:
>
> 1. Modern SSDs are so fast that even if you had an infinitely fast RAID
> card you would still be severely constrained by the limits of SAS/SATA.
> To get the full speed advantages you have to connect directly into the bus.

Correct. What we have done in the past is use smaller drives with RAID
10. This isn't for the performance but for the longevity of the drive.
We obviously could do this with Software RAID or Hardware RAID.

>
> 2. We don't typically have redundant electronic components in our
> servers. Sure, we have dual power supplies and dual NICs (though
> generally to handle external failures) and ECC-RAM but no hot-backup CPU
> or redundant RAM banks and...no backup RAID card. Intel Enterprise SSD
> already have power-fail protection so I don't need a RAID card to give
> me BBU. Given the MTBF of good enterprise SSD I'm left to wonder if
> placing a RAID card in front merely adds a new point of failure and
> scheduled-downtime-inducing hands-on maintenance (I'm looking at you,
> RAID backup battery).

That's an interesting question. It definitely adds yet another
component. I can't believe how often we need to "hotfix" a raid controller.

JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
Completely agree with Steve.

1. Intel NVMe looks like the best bet if you have modern enough hardware for NVMe. Otherwise e.g. S3700 mentioned
elsewhere.

2. RAID controllers.

We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines.
This might give people idea about where the risk lies in the path from disk to CPU.

We've had 2 RAID card failures in the last 12 months that nuked the array with days of downtime, and 2 problems with
batteriessuddenly becoming useless or suddenly reporting wildly varying temperatures/overheating. There may have been
otherRAID problems I don't know about.  

Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I guess they have 100-200 disks?). We also
haveabout 25-30 Hitachi/HGST HDDs. 

So by my estimates:
30% annual problem rate with RAID controllers
30-50% failure rate with Seagate HDDs (backblaze saw similar results)
0% failure rate with HGST HDDs.
0% failure in our SSDs.   (to be fair, our one samsung SSD apparently has a bug in TRIM under linux, which I'll need to
investigateto see if we have been affected by).  

also, RAID controllers aren't free - not just the money but also the management of them (ever tried writing a complex
installscript that interacts work with MegaCLI? It can be done but it's not much fun.). Just take a look at the MegaCLI
manualand ask yourself... is this even worth it (if you have a good MTBF on an enterprise SSD). 

RAID was meant to be about ensuring availability of data. I have trouble believing that these days....

Graeme Bell


On 06 Jul 2015, at 18:56, Steve Crawford <scrawford@pinpointresearch.com> wrote:

>
> 2. We don't typically have redundant electronic components in our servers. Sure, we have dual power supplies and dual
NICs(though generally to handle external failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no
backupRAID card. Intel Enterprise SSD already have power-fail protection so I don't need a RAID card to give me BBU.
Giventhe MTBF of good enterprise SSD I'm left to wonder if placing a RAID card in front merely adds a new point of
failureand scheduled-downtime-inducing hands-on maintenance (I'm looking at you, RAID backup battery). 



Re: New server: SSD/RAID recommendations?

From
"Mkrtchyan, Tigran"
Date:
Thanks for the Info.

So if RAID controllers are not an option, what one should use to build
big databases? LVM with xfs? BtrFs? Zfs?

Tigran.

----- Original Message -----
> From: "Graeme B. Bell" <graeme.bell@nibio.no>
> To: "Steve Crawford" <scrawford@pinpointresearch.com>
> Cc: "Wes Vaske (wvaske)" <wvaske@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org>
> Sent: Tuesday, July 7, 2015 12:22:00 PM
> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?

> Completely agree with Steve.
>
> 1. Intel NVMe looks like the best bet if you have modern enough hardware for
> NVMe. Otherwise e.g. S3700 mentioned elsewhere.
>
> 2. RAID controllers.
>
> We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines.
> This might give people idea about where the risk lies in the path from disk to
> CPU.
>
> We've had 2 RAID card failures in the last 12 months that nuked the array with
> days of downtime, and 2 problems with batteries suddenly becoming useless or
> suddenly reporting wildly varying temperatures/overheating. There may have been
> other RAID problems I don't know about.
>
> Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I
> guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs.
>
> So by my estimates:
> 30% annual problem rate with RAID controllers
> 30-50% failure rate with Seagate HDDs (backblaze saw similar results)
> 0% failure rate with HGST HDDs.
> 0% failure in our SSDs.   (to be fair, our one samsung SSD apparently has a bug
> in TRIM under linux, which I'll need to investigate to see if we have been
> affected by).
>
> also, RAID controllers aren't free - not just the money but also the management
> of them (ever tried writing a complex install script that interacts work with
> MegaCLI? It can be done but it's not much fun.). Just take a look at the
> MegaCLI manual and ask yourself... is this even worth it (if you have a good
> MTBF on an enterprise SSD).
>
> RAID was meant to be about ensuring availability of data. I have trouble
> believing that these days....
>
> Graeme Bell
>
>
> On 06 Jul 2015, at 18:56, Steve Crawford <scrawford@pinpointresearch.com> wrote:
>
>>
>> 2. We don't typically have redundant electronic components in our servers. Sure,
>> we have dual power supplies and dual NICs (though generally to handle external
>> failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no
>> backup RAID card. Intel Enterprise SSD already have power-fail protection so I
>> don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD
>> I'm left to wonder if placing a RAID card in front merely adds a new point of
>> failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at
>> you, RAID backup battery).
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
I am unsure about the performance side but, ZFS is generally very attractive to me.

Key advantages:

1) Checksumming and automatic fixing-of-broken-things on every file (not just postgres pages, but your scripts, O/S,
programfiles).  
2) Built-in  lightweight compression (doesn't help with TOAST tables, in fact may slow them down, but helpful for other
things).This may actually be a net negative for pg so maybe turn it off.  
3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's safe to replace a RAID array with a
singledrive... you can use a couple of NVMe SSDs with ZFS mirror or zraid, and  get the same availability you'd get
froma RAID controller. Slightly better, arguably, since they claim to have fixed the raid write-hole problem.  
4) filesystem snapshotting

Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU with multiple NVMe drives will outperform
quitea lot of the alternatives, with great data integrity guarantees.  

Haven't built one yet. Hope to, later this year. Steve, I would love to know more about how you're getting on with your
NVMedisk in postgres! 

Graeme.

On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran <tigran.mkrtchyan@desy.de> wrote:

> Thanks for the Info.
>
> So if RAID controllers are not an option, what one should use to build
> big databases? LVM with xfs? BtrFs? Zfs?
>
> Tigran.
>
> ----- Original Message -----
>> From: "Graeme B. Bell" <graeme.bell@nibio.no>
>> To: "Steve Crawford" <scrawford@pinpointresearch.com>
>> Cc: "Wes Vaske (wvaske)" <wvaske@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org>
>> Sent: Tuesday, July 7, 2015 12:22:00 PM
>> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
>
>> Completely agree with Steve.
>>
>> 1. Intel NVMe looks like the best bet if you have modern enough hardware for
>> NVMe. Otherwise e.g. S3700 mentioned elsewhere.
>>
>> 2. RAID controllers.
>>
>> We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines.
>> This might give people idea about where the risk lies in the path from disk to
>> CPU.
>>
>> We've had 2 RAID card failures in the last 12 months that nuked the array with
>> days of downtime, and 2 problems with batteries suddenly becoming useless or
>> suddenly reporting wildly varying temperatures/overheating. There may have been
>> other RAID problems I don't know about.
>>
>> Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I
>> guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs.
>>
>> So by my estimates:
>> 30% annual problem rate with RAID controllers
>> 30-50% failure rate with Seagate HDDs (backblaze saw similar results)
>> 0% failure rate with HGST HDDs.
>> 0% failure in our SSDs.   (to be fair, our one samsung SSD apparently has a bug
>> in TRIM under linux, which I'll need to investigate to see if we have been
>> affected by).
>>
>> also, RAID controllers aren't free - not just the money but also the management
>> of them (ever tried writing a complex install script that interacts work with
>> MegaCLI? It can be done but it's not much fun.). Just take a look at the
>> MegaCLI manual and ask yourself... is this even worth it (if you have a good
>> MTBF on an enterprise SSD).
>>
>> RAID was meant to be about ensuring availability of data. I have trouble
>> believing that these days....
>>
>> Graeme Bell
>>
>>
>> On 06 Jul 2015, at 18:56, Steve Crawford <scrawford@pinpointresearch.com> wrote:
>>
>>>
>>> 2. We don't typically have redundant electronic components in our servers. Sure,
>>> we have dual power supplies and dual NICs (though generally to handle external
>>> failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no
>>> backup RAID card. Intel Enterprise SSD already have power-fail protection so I
>>> don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD
>>> I'm left to wonder if placing a RAID card in front merely adds a new point of
>>> failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at
>>> you, RAID backup battery).
>>
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance



Re: New server: SSD/RAID recommendations?

From
"Mkrtchyan, Tigran"
Date:

----- Original Message -----
> From: "Graeme B. Bell" <graeme.bell@nibio.no>
> To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
> Cc: "Graeme B. Bell" <graeme.bell@nibio.no>, "Steve Crawford" <scrawford@pinpointresearch.com>, "Wes Vaske (wvaske)"
> <wvaske@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org>
> Sent: Tuesday, July 7, 2015 12:38:10 PM
> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?

> I am unsure about the performance side but, ZFS is generally very attractive to
> me.
>
> Key advantages:
>
> 1) Checksumming and automatic fixing-of-broken-things on every file (not just
> postgres pages, but your scripts, O/S, program files).
> 2) Built-in  lightweight compression (doesn't help with TOAST tables, in fact
> may slow them down, but helpful for other things). This may actually be a net
> negative for pg so maybe turn it off.
> 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's
> safe to replace a RAID array with a single drive... you can use a couple of
> NVMe SSDs with ZFS mirror or zraid, and  get the same availability you'd get
> from a RAID controller. Slightly better, arguably, since they claim to have
> fixed the raid write-hole problem.
> 4) filesystem snapshotting
>
> Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU
> with multiple NVMe drives will outperform quite a lot of the alternatives, with
> great data integrity guarantees.


We are planing to have a test setup as well. For now I have single NVMe SSD on my
test system:

# lspci | grep NVM
85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03)

# mount | grep nvm
/dev/nvme0n1p1 on /var/lib/pgsql/9.5 type ext4 (rw,noatime,nodiratime,data=ordered)


and quite happy with it. We have write heavy workload on it to see when it will
break. Postgres Performs very well. About x2.5 faster than with regular disks
with a single client and almost linear with multiple clients (picture attached.
On Y number of high level op/s our application does, X number of clients). The
setup is used last 3 months. Looks promising but for production we need to
to have disk size twice as big as on the test system. Until today, I was
planning to use a RAID10 with a HW controller...

Related to ZFS. We use ZFSonlinux and behaviour is not as good as with solaris.
Let's re-phrase it: performance is unpredictable. We run READZ2 with 30x3TB disks.

Tigran.

>
> Haven't built one yet. Hope to, later this year. Steve, I would love to know
> more about how you're getting on with your NVMe disk in postgres!
>
> Graeme.
>
> On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran <tigran.mkrtchyan@desy.de> wrote:
>
>> Thanks for the Info.
>>
>> So if RAID controllers are not an option, what one should use to build
>> big databases? LVM with xfs? BtrFs? Zfs?
>>
>> Tigran.
>>
>> ----- Original Message -----
>>> From: "Graeme B. Bell" <graeme.bell@nibio.no>
>>> To: "Steve Crawford" <scrawford@pinpointresearch.com>
>>> Cc: "Wes Vaske (wvaske)" <wvaske@micron.com>, "pgsql-performance"
>>> <pgsql-performance@postgresql.org>
>>> Sent: Tuesday, July 7, 2015 12:22:00 PM
>>> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
>>
>>> Completely agree with Steve.
>>>
>>> 1. Intel NVMe looks like the best bet if you have modern enough hardware for
>>> NVMe. Otherwise e.g. S3700 mentioned elsewhere.
>>>
>>> 2. RAID controllers.
>>>
>>> We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines.
>>> This might give people idea about where the risk lies in the path from disk to
>>> CPU.
>>>
>>> We've had 2 RAID card failures in the last 12 months that nuked the array with
>>> days of downtime, and 2 problems with batteries suddenly becoming useless or
>>> suddenly reporting wildly varying temperatures/overheating. There may have been
>>> other RAID problems I don't know about.
>>>
>>> Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I
>>> guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs.
>>>
>>> So by my estimates:
>>> 30% annual problem rate with RAID controllers
>>> 30-50% failure rate with Seagate HDDs (backblaze saw similar results)
>>> 0% failure rate with HGST HDDs.
>>> 0% failure in our SSDs.   (to be fair, our one samsung SSD apparently has a bug
>>> in TRIM under linux, which I'll need to investigate to see if we have been
>>> affected by).
>>>
>>> also, RAID controllers aren't free - not just the money but also the management
>>> of them (ever tried writing a complex install script that interacts work with
>>> MegaCLI? It can be done but it's not much fun.). Just take a look at the
>>> MegaCLI manual and ask yourself... is this even worth it (if you have a good
>>> MTBF on an enterprise SSD).
>>>
>>> RAID was meant to be about ensuring availability of data. I have trouble
>>> believing that these days....
>>>
>>> Graeme Bell
>>>
>>>
>>> On 06 Jul 2015, at 18:56, Steve Crawford <scrawford@pinpointresearch.com> wrote:
>>>
>>>>
>>>> 2. We don't typically have redundant electronic components in our servers. Sure,
>>>> we have dual power supplies and dual NICs (though generally to handle external
>>>> failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no
>>>> backup RAID card. Intel Enterprise SSD already have power-fail protection so I
>>>> don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD
>>>> I'm left to wonder if placing a RAID card in front merely adds a new point of
>>>> failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at
>>>> you, RAID backup battery).
>>>
>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Attachment

Re: New server: SSD/RAID recommendations?

From
Karl Denninger
Date:
On 7/7/2015 05:56, Mkrtchyan, Tigran wrote:

----- Original Message -----
From: "Graeme B. Bell" <graeme.bell@nibio.no>
To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
Cc: "Graeme B. Bell" <graeme.bell@nibio.no>, "Steve Crawford" <scrawford@pinpointresearch.com>, "Wes Vaske (wvaske)"
<wvaske@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org>
Sent: Tuesday, July 7, 2015 12:38:10 PM
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
I am unsure about the performance side but, ZFS is generally very attractive to
me.

Key advantages:

1) Checksumming and automatic fixing-of-broken-things on every file (not just
postgres pages, but your scripts, O/S, program files).
2) Built-in  lightweight compression (doesn't help with TOAST tables, in fact
may slow them down, but helpful for other things). This may actually be a net
negative for pg so maybe turn it off.
3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's
safe to replace a RAID array with a single drive... you can use a couple of
NVMe SSDs with ZFS mirror or zraid, and  get the same availability you'd get
from a RAID controller. Slightly better, arguably, since they claim to have
fixed the raid write-hole problem.
4) filesystem snapshotting

Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU
with multiple NVMe drives will outperform quite a lot of the alternatives, with
great data integrity guarantees.
Lz4 compression and standard 128kb block size has shown to be materially faster here than using 8kb blocks and no compression, both with rotating disks and SSDs.

This is workload dependent in my experience but in the applications we put Postgres to there is a very material improvement in throughput using compression and the larger blocksize, which is counter-intuitive and also opposite the "conventional wisdom."

For best throughput we use mirrored vdev sets.

--
Karl Denninger
karl@denninger.net
The Market Ticker
[S/MIME encrypted email preferred]
Attachment

Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
Hi Karl,

Great post, thanks.

Though I don't think it's against conventional wisdom to aggregate writes into larger blocks rather than rely on 4k
performanceon ssds :-)  

128kb blocks + compression certainly makes sense. But it might make less sense I suppose if you had some incredibly
highrate of churn in your rows.  
But for the work we do here, we could use 16MB blocks for all the difference it would make. (Tip to others: don't do
that.128kb block performance is already enough out the IO bus to most ssds) 

Do you have your WAL log on a compressed zfs fs?

Graeme Bell


On 07 Jul 2015, at 13:28, Karl Denninger <karl@denninger.net> wrote:

> Lz4 compression and standard 128kb block size has shown to be materially faster here than using 8kb blocks and no
compression,both with rotating disks and SSDs. 
>
> This is workload dependent in my experience but in the applications we put Postgres to there is a very material
improvementin throughput using compression and the larger blocksize, which is counter-intuitive and also opposite the
"conventionalwisdom." 
>
> For best throughput we use mirrored vdev sets.



Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
1. Does the sammy nvme have *complete* power loss protection though, for all fsync'd data?
I am very badly burned by my experiences with Crucial SSDs and their 'power loss protection' which doesn't actually
ensureall fsync'd data gets into flash. 
It certainly looks pretty with all those capacitors on top in the photos, but we need some plug pull tests to be sure.

2. Apologies for the typo in the previous post, raidz5 should have been raidz1.

3. Also, something to think about when you start having single disk solutions (or non-ZFS raid, for that matter).

SSDs are so unlike HDDs.

The samsung nvme has a UBER (uncorrectable bit error rate) measured at 1 in 10^17. That's one bit gone bad in 12500 TB,
agood number.  Chances are the drives fails before you hit a bit error, and if not, ZFS would catch it. 

Whereas current HDDS are at the 1 in 10^14 level. That means an error every 12TB, by the specs. That means, every time
youfill your cheap 6-8TB seagate drive, it likely corrupted some of your data *even if it performed according to the
spec*.(That's also why RAID5 isn't viable for rebuilding large arrays, incidentally). 

Graeme Bell


On 07 Jul 2015, at 12:56, Mkrtchyan, Tigran <tigran.mkrtchyan@desy.de> wrote:

>
>
> ----- Original Message -----
>> From: "Graeme B. Bell" <graeme.bell@nibio.no>
>> To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
>> Cc: "Graeme B. Bell" <graeme.bell@nibio.no>, "Steve Crawford" <scrawford@pinpointresearch.com>, "Wes Vaske (wvaske)"
>> <wvaske@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org>
>> Sent: Tuesday, July 7, 2015 12:38:10 PM
>> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
>
>> I am unsure about the performance side but, ZFS is generally very attractive to
>> me.
>>
>> Key advantages:
>>
>> 1) Checksumming and automatic fixing-of-broken-things on every file (not just
>> postgres pages, but your scripts, O/S, program files).
>> 2) Built-in  lightweight compression (doesn't help with TOAST tables, in fact
>> may slow them down, but helpful for other things). This may actually be a net
>> negative for pg so maybe turn it off.
>> 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's
>> safe to replace a RAID array with a single drive... you can use a couple of
>> NVMe SSDs with ZFS mirror or zraid, and  get the same availability you'd get
>> from a RAID controller. Slightly better, arguably, since they claim to have
>> fixed the raid write-hole problem.
>> 4) filesystem snapshotting
>>
>> Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU
>> with multiple NVMe drives will outperform quite a lot of the alternatives, with
>> great data integrity guarantees.
>
>
> We are planing to have a test setup as well. For now I have single NVMe SSD on my
> test system:
>
> # lspci | grep NVM
> 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03)
>
> # mount | grep nvm
> /dev/nvme0n1p1 on /var/lib/pgsql/9.5 type ext4 (rw,noatime,nodiratime,data=ordered)
>
>
> and quite happy with it. We have write heavy workload on it to see when it will
> break. Postgres Performs very well. About x2.5 faster than with regular disks
> with a single client and almost linear with multiple clients (picture attached.
> On Y number of high level op/s our application does, X number of clients). The
> setup is used last 3 months. Looks promising but for production we need to
> to have disk size twice as big as on the test system. Until today, I was
> planning to use a RAID10 with a HW controller...
>
> Related to ZFS. We use ZFSonlinux and behaviour is not as good as with solaris.
> Let's re-phrase it: performance is unpredictable. We run READZ2 with 30x3TB disks.
>
> Tigran.
>
>>
>> Haven't built one yet. Hope to, later this year. Steve, I would love to know
>> more about how you're getting on with your NVMe disk in postgres!
>>
>> Graeme.
>>
>> On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran <tigran.mkrtchyan@desy.de> wrote:
>>
>>> Thanks for the Info.
>>>
>>> So if RAID controllers are not an option, what one should use to build
>>> big databases? LVM with xfs? BtrFs? Zfs?
>>>
>>> Tigran.
>>>
>>> ----- Original Message -----
>>>> From: "Graeme B. Bell" <graeme.bell@nibio.no>
>>>> To: "Steve Crawford" <scrawford@pinpointresearch.com>
>>>> Cc: "Wes Vaske (wvaske)" <wvaske@micron.com>, "pgsql-performance"
>>>> <pgsql-performance@postgresql.org>
>>>> Sent: Tuesday, July 7, 2015 12:22:00 PM
>>>> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
>>>
>>>> Completely agree with Steve.
>>>>
>>>> 1. Intel NVMe looks like the best bet if you have modern enough hardware for
>>>> NVMe. Otherwise e.g. S3700 mentioned elsewhere.
>>>>
>>>> 2. RAID controllers.
>>>>
>>>> We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines.
>>>> This might give people idea about where the risk lies in the path from disk to
>>>> CPU.
>>>>
>>>> We've had 2 RAID card failures in the last 12 months that nuked the array with
>>>> days of downtime, and 2 problems with batteries suddenly becoming useless or
>>>> suddenly reporting wildly varying temperatures/overheating. There may have been
>>>> other RAID problems I don't know about.
>>>>
>>>> Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I
>>>> guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs.
>>>>
>>>> So by my estimates:
>>>> 30% annual problem rate with RAID controllers
>>>> 30-50% failure rate with Seagate HDDs (backblaze saw similar results)
>>>> 0% failure rate with HGST HDDs.
>>>> 0% failure in our SSDs.   (to be fair, our one samsung SSD apparently has a bug
>>>> in TRIM under linux, which I'll need to investigate to see if we have been
>>>> affected by).
>>>>
>>>> also, RAID controllers aren't free - not just the money but also the management
>>>> of them (ever tried writing a complex install script that interacts work with
>>>> MegaCLI? It can be done but it's not much fun.). Just take a look at the
>>>> MegaCLI manual and ask yourself... is this even worth it (if you have a good
>>>> MTBF on an enterprise SSD).
>>>>
>>>> RAID was meant to be about ensuring availability of data. I have trouble
>>>> believing that these days....
>>>>
>>>> Graeme Bell
>>>>
>>>>
>>>> On 06 Jul 2015, at 18:56, Steve Crawford <scrawford@pinpointresearch.com> wrote:
>>>>
>>>>>
>>>>> 2. We don't typically have redundant electronic components in our servers. Sure,
>>>>> we have dual power supplies and dual NICs (though generally to handle external
>>>>> failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no
>>>>> backup RAID card. Intel Enterprise SSD already have power-fail protection so I
>>>>> don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD
>>>>> I'm left to wonder if placing a RAID card in front merely adds a new point of
>>>>> failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at
>>>>> you, RAID backup battery).
>>>>
>>>>
>>>>
>>>> --
>>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>>> To make changes to your subscription:
>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
> <pg-with-ssd.png>



Re: New server: SSD/RAID recommendations?

From
Karl Denninger
Date:
On 7/7/2015 06:52, Graeme B. Bell wrote:
Hi Karl,

Great post, thanks. 

Though I don't think it's against conventional wisdom to aggregate writes into larger blocks rather than rely on 4k performance on ssds :-) 

128kb blocks + compression certainly makes sense. But it might make less sense I suppose if you had some incredibly high rate of churn in your rows. 
But for the work we do here, we could use 16MB blocks for all the difference it would make. (Tip to others: don't do that. 128kb block performance is already enough out the IO bus to most ssds)

Do you have your WAL log on a compressed zfs fs? 

Graeme Bell
Yes.

Data goes on one mirrored set of vdevs, pg_xlog goes on a second, separate pool.  WAL goes on a third pool on RaidZ2.  WAL typically goes on rotating storage since I use it (and a basebackup) as disaster recovery (and in hot spare apps the source for the syncing hot standbys) and that's nearly a big-block-write-only data stream.  Rotating media is fine for that in most applications.  I take a new basebackup on reasonable intervals and rotate the WAL logs to keep that from growing without boundary.

I use LSI host adapters for the drives themselves (no hardware RAID); I'm currently running on FreeBSD 10.1.  Be aware that ZFS on FreeBSD has some fairly nasty issues that I developed (and publish) a patch for; without it some workloads can result in very undesirable behavior where working set gets paged out in favor of ZFS ARC; if that happens your performance will go straight into the toilet.

Back before FreeBSD 9 when ZFS was simply not stable enough for me I used ARECA hardware RAID adapters and rotating media with BBUs and large cache memory installed on them with UFS filesystems.  Hardware adapters are, however, a net lose in a ZFS environment even when they nominally work well (and they frequently interact very badly with ZFS during certain operations making them just flat-out unsuitable.)  All-in I far prefer ZFS on a host adapter to UFS on a RAID adapter both from a data integrity and performance standpoint.

My SSD drives of choice are all Intel; for lower-end requirements the 730s work very well; the S3500 is next and if your write volume is high enough the S3700 has much greater endurance (but at a correspondingly higher price.)  All three are properly power-fail protected.  All three are much, much faster than rotating storage.  If you can saturate the SATA channels and need still more I/O throughput NVMe drives are the next quantum up in performance; I'm not there with our application at the present time.

Incidentally while there are people who have questioned the 730 series power loss protection I've tested it with plug-pulls and in addition it watchdogs its internal power loss capacitors -- from the smartctl -a display of one of them on an in-service machine here:

175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       643 (4 6868)


--
Karl Denninger
karl@denninger.net
The Market Ticker
[S/MIME encrypted email preferred]
Attachment

Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
Thanks, this is very useful to know about the 730. When you say 'tested it with plug-pulls', you were using
diskchecker.pl,right? 

Graeme.

On 07 Jul 2015, at 14:39, Karl Denninger <karl@denninger.net> wrote:

>
> Incidentally while there are people who have questioned the 730 series power loss protection I've tested it with
plug-pullsand in addition it watchdogs its internal power loss capacitors -- from the smartctl -a display of one of
themon an in-service machine here: 
>
> 175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       643 (4 6868)



Re: New server: SSD/RAID recommendations?

From
Merlin Moncure
Date:
On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) <wvaske@micron.com> wrote:

Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review

 

The only database-specific info we have available are for Cassandra and MSSQL:

http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf

http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf

 

(some of that info might be relevant)

 

In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison:

Micron M500DC (20nm) – 2 DWPD

Intel S3500 (20nm) – 0.3 DWPD

Intel S3510 (16nm) – 0.3 DWPD

Intel S3710 (20nm) – 10 DWPD

 

They’re all great drives, the question is how write-intensive is the workload.



Intel added a new product, the 3610, that is rated for 3 DWPD.  Pricing looks to be around 1.20$/GB.

merlin 

Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
As I have warned elsewhere,

The M500/M550 from $SOME_COMPANY is NOT SUITABLE for postgres unless you have a RAID controller with BBU to protect
yourself.
The M500/M550 are NOT plug-pull safe despite the 'power loss protection' claimed on the packaging. Not all fsync'd data
ispreserved in the event of a power loss, which completely undermines postgres's sanity.  

I would be extremely skeptical about the M500DC given the name and manufacturer.

I went to quite a lot of trouble to provide $SOME_COMPANYs engineers with the full details of this fault after
extensivetesting (we have e.g. 20-25 of these disks) on multiple machines and controllers, at their request. Result:
theystopped replying to me, and soon after I saw their PR reps talking about how 'power loss protection isn't about
protectingall data during a power loss'.  

The only safe way to use an M500/M550 with postgres is:

a) disable the disk cache, which will cripple performance to about 3-5% of normal.
b) use a battery backed or cap-backed RAID controller, which will generally hurt performance, by limiting you to the
peakperformance of the flash on the raid controller.  

If you are buying such a drive, I strongly recommend buying only one and doing extensive plug pull testing before
commitingto several.  
For myself, my time is valuable enough that it will be cheaper to buy intel in future.

Graeme.

On 07 Jul 2015, at 15:12, Merlin Moncure <mmoncure@gmail.com> wrote:

> On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) <wvaske@micron.com> wrote:
> Storage Review has a pretty good process and reviewed the M500DC when it released last year.
http://www.storagereview.com/micron_m500dc_enterprise_ssd_review
>
>
>
> The only database-specific info we have available are for Cassandra and MSSQL:
>
>
http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf
>
>
http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf
>
>
>
> (some of that info might be relevant)
>
>
>
> In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison:
>
> Micron M500DC (20nm) – 2 DWPD
>
> Intel S3500 (20nm) – 0.3 DWPD
>
> Intel S3510 (16nm) – 0.3 DWPD
>
> Intel S3710 (20nm) – 10 DWPD
>
>
>
> They’re all great drives, the question is how write-intensive is the workload.
>
>
>
>
> Intel added a new product, the 3610, that is rated for 3 DWPD.  Pricing looks to be around 1.20$/GB.
>
> merlin



Re: New server: SSD/RAID recommendations?

From
"Wes Vaske (wvaske)"
Date:
The M500/M550/M600 are consumer class drives that don't have power protection for all inflight data.* (like the Samsung
8x0series and the Intel 3x0 & 5x0 series). 

The M500DC has full power protection for inflight data and is an enterprise-class drive (like the Samsung 845DC or
IntelS3500 & S3700 series). 

So any drive without the capacitors to protect inflight data will suffer from data loss if you're using disk write
cacheand you pull the power. 

*Big addendum:
There are two issues on powerloss that will mess with Postgres. Data Loss and Data Corruption. The micron consumer
driveswill have power loss protection against Data Corruption and the enterprise drive will have power loss protection
againstBOTH. 

https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf

The Data Corruption problem is only an issue in non-SLC NAND but it's industry wide. And even though some drives will
protectagainst that, the protection of inflight data that's been fsync'd is more important and should disqualify *any*
consumerdrives from *any* company from consideration for use with Postgres. 

Wes Vaske | Senior Storage Solutions Engineer
Micron Technology

-----Original Message-----
From: Graeme B. Bell [mailto:graeme.bell@nibio.no]
Sent: Tuesday, July 07, 2015 8:26 AM
To: Merlin Moncure
Cc: Wes Vaske (wvaske); Craig James; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?


As I have warned elsewhere,

The M500/M550 from $SOME_COMPANY is NOT SUITABLE for postgres unless you have a RAID controller with BBU to protect
yourself.
The M500/M550 are NOT plug-pull safe despite the 'power loss protection' claimed on the packaging. Not all fsync'd data
ispreserved in the event of a power loss, which completely undermines postgres's sanity.  

I would be extremely skeptical about the M500DC given the name and manufacturer.

I went to quite a lot of trouble to provide $SOME_COMPANYs engineers with the full details of this fault after
extensivetesting (we have e.g. 20-25 of these disks) on multiple machines and controllers, at their request. Result:
theystopped replying to me, and soon after I saw their PR reps talking about how 'power loss protection isn't about
protectingall data during a power loss'.  

The only safe way to use an M500/M550 with postgres is:

a) disable the disk cache, which will cripple performance to about 3-5% of normal.
b) use a battery backed or cap-backed RAID controller, which will generally hurt performance, by limiting you to the
peakperformance of the flash on the raid controller.  

If you are buying such a drive, I strongly recommend buying only one and doing extensive plug pull testing before
commitingto several.  
For myself, my time is valuable enough that it will be cheaper to buy intel in future.

Graeme.

On 07 Jul 2015, at 15:12, Merlin Moncure <mmoncure@gmail.com> wrote:

> On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) <wvaske@micron.com> wrote:
> Storage Review has a pretty good process and reviewed the M500DC when it released last year.
http://www.storagereview.com/micron_m500dc_enterprise_ssd_review
>
>
>
> The only database-specific info we have available are for Cassandra and MSSQL:
>
>
http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf
>
>
http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf
>
>
>
> (some of that info might be relevant)
>
>
>
> In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison:
>
> Micron M500DC (20nm) - 2 DWPD
>
> Intel S3500 (20nm) - 0.3 DWPD
>
> Intel S3510 (16nm) - 0.3 DWPD
>
> Intel S3710 (20nm) - 10 DWPD
>
>
>
> They're all great drives, the question is how write-intensive is the workload.
>
>
>
>
> Intel added a new product, the 3610, that is rated for 3 DWPD.  Pricing looks to be around 1.20$/GB.
>
> merlin



Re: New server: SSD/RAID recommendations?

From
Heikki Linnakangas
Date:
On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote:
> The M500/M550/M600 are consumer class drives that don't have power
> protection for all inflight data.* (like the Samsung 8x0 series and
> the Intel 3x0 & 5x0 series).
>
> The M500DC has full power protection for inflight data and is an
> enterprise-class drive (like the Samsung 845DC or Intel S3500 & S3700
> series).
>
> So any drive without the capacitors to protect inflight data will
> suffer from data loss if you're using disk write cache and you pull
> the power.

Wow, I would be pretty angry if I installed a SSD in my desktop, and it
loses a file that I saved just before pulling the power plug.

> *Big addendum: There are two issues on powerloss that will mess with
> Postgres. Data Loss and Data Corruption. The micron consumer drives
> will have power loss protection against Data Corruption and the
> enterprise drive will have power loss protection against BOTH.
>
> https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf
>
>  The Data Corruption problem is only an issue in non-SLC NAND but
> it's industry wide. And even though some drives will protect against
> that, the protection of inflight data that's been fsync'd is more
> important and should disqualify *any* consumer drives from *any*
> company from consideration for use with Postgres.

So it lies about fsync()... The next question is, does it nevertheless
enforce the correct ordering of persisting fsync'd data? If you write to
file A and fsync it, then write to another file B and fsync it too, is
it guaranteed that if B is persisted, A is as well? Because if it isn't,
you can end up with filesystem (or database) corruption anyway.

- Heikki



Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
Hi Wes

1. The first interesting thing is that prior to my mentioning this problem to C_____ a year or two back, the power loss
protectionwas advertised everywhere as simply that, without qualifiers about 'not inflight data'. Check out the
marketingof the M500 for the first year or so and try to find an example where they say 'but inflight data isn't
protected!'. 

2. The second (and more important) interesting thing is that this is irrelevant!

Fsync'd data is BY DEFINITION not data in flight.
Fsync means "This data is secure on the disk!"
However, the drives corrupt it.

Postgres's sanity depends on a reliable fsync. That's why we see posts on the performance list saying 'fsync=no makes
yourpostgres faster but really, don't do it in production".  
We are talking about internal DB corruption, not just a crash and a few lost transactions.

These drives return from fsync while data is still in volatile cache.
That's breaking the spec, and it's why they are not OK for postgres by themselves.

This is not about 'in-flight' data, it's about fsync'd wal log data.

Graeme.


On 07 Jul 2015, at 16:15, Wes Vaske (wvaske) <wvaske@micron.com> wrote:

> The M500/M550/M600 are consumer class drives that don't have power protection for all inflight data.* (like the
Samsung8x0 series and the Intel 3x0 & 5x0 series). 
>
> The M500DC has full power protection for inflight data and is an enterprise-class drive (like the Samsung 845DC or
IntelS3500 & S3700 series). 
>
> So any drive without the capacitors to protect inflight data will suffer from data loss if you're using disk write
cacheand you pull the power. 
>
> *Big addendum:
> There are two issues on powerloss that will mess with Postgres. Data Loss and Data Corruption. The micron consumer
driveswill have power loss protection against Data Corruption and the enterprise drive will have power loss protection
againstBOTH. 
>
> https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf
>
> The Data Corruption problem is only an issue in non-SLC NAND but it's industry wide. And even though some drives will
protectagainst that, the protection of inflight data that's been fsync'd is more important and should disqualify *any*
consumerdrives from *any* company from consideration for use with Postgres. 
>
> Wes Vaske | Senior Storage Solutions Engineer
> Micron Technology
>
> -----Original Message-----
> From: Graeme B. Bell [mailto:graeme.bell@nibio.no]
> Sent: Tuesday, July 07, 2015 8:26 AM
> To: Merlin Moncure
> Cc: Wes Vaske (wvaske); Craig James; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
>
>
> As I have warned elsewhere,
>
> The M500/M550 from $SOME_COMPANY is NOT SUITABLE for postgres unless you have a RAID controller with BBU to protect
yourself.
> The M500/M550 are NOT plug-pull safe despite the 'power loss protection' claimed on the packaging. Not all fsync'd
datais preserved in the event of a power loss, which completely undermines postgres's sanity.  
>
> I would be extremely skeptical about the M500DC given the name and manufacturer.
>
> I went to quite a lot of trouble to provide $SOME_COMPANYs engineers with the full details of this fault after
extensivetesting (we have e.g. 20-25 of these disks) on multiple machines and controllers, at their request. Result:
theystopped replying to me, and soon after I saw their PR reps talking about how 'power loss protection isn't about
protectingall data during a power loss'.  
>
> The only safe way to use an M500/M550 with postgres is:
>
> a) disable the disk cache, which will cripple performance to about 3-5% of normal.
> b) use a battery backed or cap-backed RAID controller, which will generally hurt performance, by limiting you to the
peakperformance of the flash on the raid controller.  
>
> If you are buying such a drive, I strongly recommend buying only one and doing extensive plug pull testing before
commitingto several.  
> For myself, my time is valuable enough that it will be cheaper to buy intel in future.
>
> Graeme.
>
> On 07 Jul 2015, at 15:12, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) <wvaske@micron.com> wrote:
>> Storage Review has a pretty good process and reviewed the M500DC when it released last year.
http://www.storagereview.com/micron_m500dc_enterprise_ssd_review
>>
>>
>>
>> The only database-specific info we have available are for Cassandra and MSSQL:
>>
>>
http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf
>>
>>
http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf
>>
>>
>>
>> (some of that info might be relevant)
>>
>>
>>
>> In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison:
>>
>> Micron M500DC (20nm) - 2 DWPD
>>
>> Intel S3500 (20nm) - 0.3 DWPD
>>
>> Intel S3510 (16nm) - 0.3 DWPD
>>
>> Intel S3710 (20nm) - 10 DWPD
>>
>>
>>
>> They're all great drives, the question is how write-intensive is the workload.
>>
>>
>>
>>
>> Intel added a new product, the 3610, that is rated for 3 DWPD.  Pricing looks to be around 1.20$/GB.
>>
>> merlin
>



Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks...

diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected
M500s/M550s,so I think it's pretty much open to debate what types of madness and corruption you'll find if you look
closeenough. 

G


On 07 Jul 2015, at 16:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

>
> So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting
fsync'ddata? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that
ifB is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. 
>
> - Heikki



Re: New server: SSD/RAID recommendations?

From
Vitalii Tymchyshyn
Date:

Hi.

How would BBU cache help you if it lies about fsync? I suppose any RAID controller removes data from BBU cache after it was fsynced by the drive. As I know, there is no other "magic command" for drive to tell controller that the data is safe now and can be removed from BBU cache.

Вт, 7 лип. 2015 11:59 Graeme B. Bell <graeme.bell@nibio.no> пише:

Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks...

diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected M500s/M550s, so I think it's pretty much open to debate what types of madness and corruption you'll find if you look close enough.

G


On 07 Jul 2015, at 16:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

>
> So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway.
>
> - Heikki



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: New server: SSD/RAID recommendations?

From
Merlin Moncure
Date:
> On 07 Jul 2015, at 16:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
>>
>> So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting
fsync'ddata? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that
ifB is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. 

On Tue, Jul 7, 2015 at 10:58 AM, Graeme B. Bell <graeme.bell@nibio.no> wrote:
>
> Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks...
>
> diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected
M500s/M550s,so I think it's pretty much open to debate what types of madness and corruption you'll find if you look
closeenough. 

100% agree with your sentiments.   I do believe that there are other
enterprise SSD vendors that offer reliable parts but not at the price
point intel does for the cheaper drives.  The consumer grade vendors
are simply not trustworthy unless proven otherwise (I had my own
unpleasant experience with OCZ for example).  Intel played the same
game with their early parts but have since become a model of how to
ship drives to the market.

RAID controllers are completely unnecessary for SSD as they currently
exist.  Software raid is superior in every way; the hardware features
of raid controllers, BBU, write caching, and write consolidation are
redundant to what the SSD themselves do (being themselves RAID 0
basically).  A hypothetical SSD optimized raid controller is possible;
it could do things like balance wear and optimize writes across
multiple physical drives.  This would require deep participation
between the drive and the controller and FWICT no such things exists
excepting super expensive sans which I don't recommend anyways.

merlin


Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
>
> RAID controllers are completely unnecessary for SSD as they currently
> exist.

Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho.

In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs
init as a starting point.  
So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it has
workedvery well. 
Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and bypass
thecache.  

The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme
andtrash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck,
I'llprobably put one in my next gaming PC.  

Re: software raid.

I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at ZFS
andthinking 'why the heck am I even using software raid?' 

G



Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
That is a very good question, which I have raised elsewhere on the postgresql lists previously.

In practice: I have *never* managed to make diskchecker fail with the BBU enabled in front of the drives and I spent
daystrying with plug pulls till I reached the point where as a statistical event it just can't be that likely at all.
That'snot to say it can't ever happen, just that I've taken all reasonable measures that I can to find out on the time
andmoney budget I had available. 
 

In theory: It may be the fact the BBU makes the drives run at about half speed, so that the capacitors go a good bit
furtherto empty the cache, after all: without the BBU in the way, the drive manages to save everything but the last
fragmentof writes. But I also suspect that the controller itself maybe replaying the last set of writes from around the
timeof power loss. 
 

Anyway I'm 50/50 on those two explanations. Any other thoughts welcome. 

This raises another interesting question. Does anyone hear have a document explaining how their BBU cache works EXACTLY
(atcache / sata level) on their server? Because I haven't been able to find any for mine (Dell PERC H710/H710P). Can
anyonetell me with godlike authority and precision, what exactly happens inside that BBU post-power failure?
 

There is rather too much magic involved for me to be happy.

G

On 07 Jul 2015, at 18:27, Vitalii Tymchyshyn <vit@tym.im> wrote:

> Hi.
> 
> How would BBU cache help you if it lies about fsync? I suppose any RAID controller removes data from BBU cache after
itwas fsynced by the drive. As I know, there is no other "magic command" for drive to tell controller that the data is
safenow and can be removed from BBU cache.
 
> 
> Вт, 7 лип. 2015 11:59 Graeme B. Bell <graeme.bell@nibio.no> пише:
> 
> Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks...
> 
> diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected
M500s/M550s,so I think it's pretty much open to debate what types of madness and corruption you'll find if you look
closeenough.
 
> 
> G
> 
> 
> On 07 Jul 2015, at 16:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> 
> >
> > So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting
fsync'ddata? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that
ifB is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway.
 
> >
> > - Heikki
> 
> 
> 
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: New server: SSD/RAID recommendations?

From
Wei Shan
Date:
Hi Graeme,

Why would you think that you don't need RAID for ZFS?

Reason I'm asking if because we are moving to ZFS on FreeBSD for our future projects.

Regards,
Wei Shan

On 8 July 2015 at 00:46, Graeme B. Bell <graeme.bell@nibio.no> wrote:
>
> RAID controllers are completely unnecessary for SSD as they currently
> exist.

Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho.

In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs in it as a starting point.
So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it has worked very well.
Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and bypass the cache.

The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme and trash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck, I'll probably put one in my next gaming PC.

Re: software raid.

I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at ZFS and thinking 'why the heck am I even using software raid?'

G



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--
Regards,
Ang Wei Shan

Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
> 
> This raises another interesting question. Does anyone hear have a document explaining how their BBU cache works
EXACTLY(at cache / sata level) on their server? Because I haven't been able to find any for mine (Dell PERC
H710/H710P).Can anyone tell me with godlike authority and precision, what exactly happens inside that BBU post-power
failure?


(and if you have that manual - how can you know it's accurate? that the implementation matches the manual and is free
ofbugs? because my M500s didn't match the packaging and neither did a  H710 we bought - Dell had advertised features in
somemarketing material that were only present on the H710P)
 

And I see UBER (unrecoverable bit error) rates for SSDs and HDDs, but has anyone ever seen them for the flash-based
cacheon their raid controller?
 

Sleep well, friends.

Graeme. 

On 07 Jul 2015, at 18:54, Graeme B. Bell <graeme.bell@nibio.no> wrote:

> 
> That is a very good question, which I have raised elsewhere on the postgresql lists previously.
> 
> In practice: I have *never* managed to make diskchecker fail with the BBU enabled in front of the drives and I spent
daystrying with plug pulls till I reached the point where as a statistical event it just can't be that likely at all.
That'snot to say it can't ever happen, just that I've taken all reasonable measures that I can to find out on the time
andmoney budget I had available. 
 
> 
> In theory: It may be the fact the BBU makes the drives run at about half speed, so that the capacitors go a good bit
furtherto empty the cache, after all: without the BBU in the way, the drive manages to save everything but the last
fragmentof writes. But I also suspect that the controller itself maybe replaying the last set of writes from around the
timeof power loss. 
 
> 
> Anyway I'm 50/50 on those two explanations. Any other thoughts welcome. 
> 
> This raises another interesting question. Does anyone hear have a document explaining how their BBU cache works
EXACTLY(at cache / sata level) on their server? Because I haven't been able to find any for mine (Dell PERC
H710/H710P).Can anyone tell me with godlike authority and precision, what exactly happens inside that BBU post-power
failure?
> 
> There is rather too much magic involved for me to be happy.
> 
> G
> 
> On 07 Jul 2015, at 18:27, Vitalii Tymchyshyn <vit@tym.im> wrote:
> 
>> Hi.
>> 
>> How would BBU cache help you if it lies about fsync? I suppose any RAID controller removes data from BBU cache after
itwas fsynced by the drive. As I know, there is no other "magic command" for drive to tell controller that the data is
safenow and can be removed from BBU cache.
 
>> 
>> Вт, 7 лип. 2015 11:59 Graeme B. Bell <graeme.bell@nibio.no> пише:
>> 
>> Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks...
>> 
>> diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on
unprotectedM500s/M550s, so I think it's pretty much open to debate what types of madness and corruption you'll find if
youlook close enough.
 
>> 
>> G
>> 
>> 
>> On 07 Jul 2015, at 16:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> 
>>> 
>>> So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting
fsync'ddata? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that
ifB is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway.
 
>>> 
>>> - Heikki
>> 
>> 
>> 
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
> 


Re: New server: SSD/RAID recommendations?

From
Karl Denninger
Date:
After a plug-pull during the create, reboot and here is the verify:

root@Dbms2:/var/tmp # ./diskchecker.pl -s newfs verify /test/biteme
 verifying: 0.00%
 verifying: 3.81%
 verifying: 10.91%
 verifying: 18.71%
 verifying: 26.46%
 verifying: 33.95%
 verifying: 41.20%
 verifying: 49.48%
 verifying: 57.23%
 verifying: 64.89%
 verifying: 72.54%
 verifying: 80.04%
 verifying: 87.96%
 verifying: 95.15%
 verifying: 100.00%
Total errors: 0

da6 at mps0 bus 0 scbus0 target 17 lun 0
da6: <ATA INTEL SSDSC2BP24 0420> Fixed Direct Access SPC-4 SCSI device
da6: Serial Number BTJR446401KW240AGN 
da6: 600.000MB/s transfers
da6: Command Queueing enabled
da6: 228936MB (468862128 512 byte sectors: 255H 63S/T 29185C)

# smartctl -a /dev/da6

=== START OF INFORMATION SECTION ===
Model Family:     Intel 730 and DC S3500/S3700 Series SSDs
Device Model:     INTEL SSDSC2BP240G4
Serial Number:    BTJR446401KW240AGN
LU WWN Device Id: 5 5cd2e4 04b71afc7
Firmware Version: L2010420
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jul  7 17:01:36 2015 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Note -- same firmware between all three series of Intel devices...... :-)

Yes, I like these SSDs -- they don't lie and they don't lose data on a power-pull.


On 7/7/2015 08:08, Graeme B. Bell wrote:
Thanks, this is very useful to know about the 730. When you say 'tested it with plug-pulls', you were using diskchecker.pl, right?

Graeme.

On 07 Jul 2015, at 14:39, Karl Denninger <karl@denninger.net> wrote:

Incidentally while there are people who have questioned the 730 series power loss protection I've tested it with plug-pulls and in addition it watchdogs its internal power loss capacitors -- from the smartctl -a display of one of them on an in-service machine here:

175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       643 (4 6868)



--
Karl Denninger
karl@denninger.net
The Market Ticker
[S/MIME encrypted email preferred]
Attachment

Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
> Why would you think that you don't need RAID for ZFS?
>
> Reason I'm asking if because we are moving to ZFS on FreeBSD for our future projects.


Because you have zraid. :-)

https://blogs.oracle.com/bonwick/entry/raid_z

General points:

1. It's my understanding that ZFS is designed to talk to the hardware directly, and so it would be bad to hide the
physicallayer from ZFS unless you had to. 
After all, I don't think they implemented a raid-like system inside ZFS just for the fun of it.

2. You have zraid built in and easy to manage within ZFS - and well tested compared to NewRaidController (TM) - why add
anotherlayer of management to your disk storage? 

3. You reintroduce the raid write hole.

4. There might be some argument for hardware raid (existing system) but with software raid (the point I was addressing)
itmakes little sense at all. 

5. If you're on hardware raid and your controller dies, you're screwed in several ways. It's harder to get a new raid
controllerthan a new pc. Your chances of recovery are lower than zfs. IMHO more scary to recover from a failed raid
controller,too.  

6. Recovery is faster if the disks aren't full. e.g. ZFS recovers what it is there. This might not seem a big deal but
chancesare it would save you 50% of your downtime in a crisis.  

However, I think with Linux you might want to use RAID for the boot disk. I don't know if linux can boot from ZFS yet.
Iwould (and am) using Freebsd with zfs. 

Graeme.


On 07 Jul 2015, at 18:56, Wei Shan <weishan.ang@gmail.com> wrote:

> Hi Graeme,
>
> Why would you think that you don't need RAID for ZFS?
>
> Reason I'm asking if because we are moving to ZFS on FreeBSD for our future projects.
>
> Regards,
> Wei Shan
>
> On 8 July 2015 at 00:46, Graeme B. Bell <graeme.bell@nibio.no> wrote:
> >
> > RAID controllers are completely unnecessary for SSD as they currently
> > exist.
>
> Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho.
>
> In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs
init as a starting point. 
> So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it
hasworked very well. 
> Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and
bypassthe cache. 
>
> The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme
andtrash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck,
I'llprobably put one in my next gaming PC. 
>
> Re: software raid.
>
> I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at
ZFSand thinking 'why the heck am I even using software raid?' 
>
> G
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
>
>
> --
> Regards,
> Ang Wei Shan



Re: New server: SSD/RAID recommendations?

From
Michael Nolan
Date:


On Tue, Jul 7, 2015 at 10:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote:
The M500/M550/M600 are consumer class drives that don't have power
protection for all inflight data.* (like the Samsung 8x0 series and
the Intel 3x0 & 5x0 series).

The M500DC has full power protection for inflight data and is an
enterprise-class drive (like the Samsung 845DC or Intel S3500 & S3700
series).

So any drive without the capacitors to protect inflight data will
suffer from data loss if you're using disk write cache and you pull
the power.

Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling the power plug.

That can (and does) happen with spinning disks, too.
 

*Big addendum: There are two issues on powerloss that will mess with
Postgres. Data Loss and Data Corruption. The micron consumer drives
will have power loss protection against Data Corruption and the
enterprise drive will have power loss protection against BOTH.

https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf

 The Data Corruption problem is only an issue in non-SLC NAND but
it's industry wide. And even though some drives will protect against
that, the protection of inflight data that's been fsync'd is more
important and should disqualify *any* consumer drives from *any*
company from consideration for use with Postgres.

So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway.

- Heikki


The sad fact is that MANY drives (ssd as well as spinning) lie about their fsync status.
--
Mike Nolan  

Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
The comment on HDDs is true and gave me another thought.

These new 'shingled' HDDs (the 8TB ones) rely on rewriting all the data on tracks that overlap your data, any time you
changethe data. Result: disks 8-20x slower during writes, after they fill up.  

Do they have power loss protection for the data being rewritten during reshingling? You could have data commited at
positionX and you accidentally nuke data at position Y. 

[I know that using a shingled disk sounds crazy (it sounds crazy to me) but you can bet there are people that just want
tomax out the disk bays in their server... ] 

Graeme.

On 07 Jul 2015, at 19:28, Michael Nolan <htfoot@gmail.com> wrote:

>
>
> On Tue, Jul 7, 2015 at 10:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote:
> The M500/M550/M600 are consumer class drives that don't have power
> protection for all inflight data.* (like the Samsung 8x0 series and
> the Intel 3x0 & 5x0 series).
>
> The M500DC has full power protection for inflight data and is an
> enterprise-class drive (like the Samsung 845DC or Intel S3500 & S3700
> series).
>
> So any drive without the capacitors to protect inflight data will
> suffer from data loss if you're using disk write cache and you pull
> the power.
>
> Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling
thepower plug. 
>
> That can (and does) happen with spinning disks, too.
>
>
> *Big addendum: There are two issues on powerloss that will mess with
> Postgres. Data Loss and Data Corruption. The micron consumer drives
> will have power loss protection against Data Corruption and the
> enterprise drive will have power loss protection against BOTH.
>
> https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf
>
>  The Data Corruption problem is only an issue in non-SLC NAND but
> it's industry wide. And even though some drives will protect against
> that, the protection of inflight data that's been fsync'd is more
> important and should disqualify *any* consumer drives from *any*
> company from consideration for use with Postgres.
>
> So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting
fsync'ddata? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that
ifB is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. 
>
> - Heikki
>
>
> The sad fact is that MANY drives (ssd as well as spinning) lie about their fsync status.
> --
> Mike Nolan
>



Re: New server: SSD/RAID recommendations?

From
Scott Marlowe
Date:
On Tue, Jul 7, 2015 at 11:43 AM, Graeme B. Bell <graeme.bell@nibio.no> wrote:
>
> The comment on HDDs is true and gave me another thought.
>
> These new 'shingled' HDDs (the 8TB ones) rely on rewriting all the data on tracks that overlap your data, any time
youchange the data. Result: disks 8-20x slower during writes, after they fill up. 
>
> Do they have power loss protection for the data being rewritten during reshingling? You could have data commited at
positionX and you accidentally nuke data at position Y. 
>
> [I know that using a shingled disk sounds crazy (it sounds crazy to me) but you can bet there are people that just
wantto max out the disk bays in their server... ] 

Let's just say no online backup companies are using those disks. :)
Biggest current production spinners being used I know of are 4TB,
non-shingled.


Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
On 07 Jul 2015, at 19:47, Scott Marlowe <scott.marlowe@gmail.com> wrote:

>> [I know that using a shingled disk sounds crazy (it sounds crazy to me) but you can bet there are people that just
wantto max out the disk bays in their server... ] 
>
> Let's just say no online backup companies are using those disks. :)

I'm not so sure. Literally the most famous online backup company is (or was planning to):
https://www.backblaze.com/blog/6-tb-hard-drive-face-off/
But I think that a massive read-only archive really is the only use for these things. I hope they go out of fashion,
soon. 

But I was thinking more of the 'small company postgres server' or 'charitable organisation postgres server'.
Someone is going to make this mistake, you can bet.
Probably not someone on THIS list, of course...

> Biggest current production spinners being used I know of are 4TB,
> non-shingled.

I think we may have some 6TB WD reds around here. I'll need to look around.

G



Re: New server: SSD/RAID recommendations?

From
"Wes Vaske (wvaske)"
Date:

Regarding:

“lie about their fsync status.”

 

This is mostly semantics but it might help google searches on the issue.

 

A drive doesn’t support fsync(), that’s a filesystem/kernel process. A drive will do a FLUSH CACHE. Before kernels 2.6.<low numbers> the fsync() call wouldn’t sent any ATA or SCSI command to flush the disk cache. Whereas—AFAICT—modern kernels and file system versions *will* do this. When ‘sync’ is called the filesystem will issue the appropriate command to the disk to flush the write cache.

 

For ATA, this is “FLUSH CACHE” (E7h). To check support for the command use:

[root@postgres ~]# smartctl --identify /dev/sdu | grep "FLUSH CACHE"

  83     13          1   FLUSH CACHE EXT supported

  83     12          1   FLUSH CACHE supported

  86     13          1   FLUSH CACHE EXT supported

  86     12          1   FLUSH CACHE supported

 

The 1s in the 3rd column represent SUPPORTED for the feature listed in the last column.

 

Cheers,

Wes Vaske

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Michael Nolan
Sent: Tuesday, July 07, 2015 12:28 PM
To: hlinnaka@iki.fi
Cc: Wes Vaske (wvaske); Graeme B. Bell; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] New server: SSD/RAID recommendations?

 

 

 

On Tue, Jul 7, 2015 at 10:59 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote:

The M500/M550/M600 are consumer class drives that don't have power
protection for all inflight data.* (like the Samsung 8x0 series and
the Intel 3x0 & 5x0 series).

The M500DC has full power protection for inflight data and is an
enterprise-class drive (like the Samsung 845DC or Intel S3500 & S3700
series).

So any drive without the capacitors to protect inflight data will
suffer from data loss if you're using disk write cache and you pull
the power.


Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling the power plug.

 

That can (and does) happen with spinning disks, too.
 

 

*Big addendum: There are two issues on powerloss that will mess with
Postgres. Data Loss and Data Corruption. The micron consumer drives
will have power loss protection against Data Corruption and the
enterprise drive will have power loss protection against BOTH.

https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf

 The Data Corruption problem is only an issue in non-SLC NAND but
it's industry wide. And even though some drives will protect against
that, the protection of inflight data that's been fsync'd is more
important and should disqualify *any* consumer drives from *any*
company from consideration for use with Postgres.


So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway.

- Heikki

 

 

The sad fact is that MANY drives (ssd as well as spinning) lie about their fsync status.
--

Mike Nolan  

 

Re: New server: SSD/RAID recommendations?

From
Merlin Moncure
Date:
On Tue, Jul 7, 2015 at 11:46 AM, Graeme B. Bell <graeme.bell@nibio.no> wrote:
>>
>> RAID controllers are completely unnecessary for SSD as they currently
>> exist.
>
> Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho.
>
> In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs
init as a starting point. 
> So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it
hasworked very well. 
> Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and
bypassthe cache. 
>
> The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme
andtrash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck,
I'llprobably put one in my next gaming PC. 
>
> Re: software raid.
>
> I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at
ZFSand thinking 'why the heck am I even using software raid?' 

Good point. At least for me, I've yet to jump on the ZFS bandwagon and
so don't have an opinion on it.

merlin


Re: New server: SSD/RAID recommendations?

From
Heikki Linnakangas
Date:
On 07/07/2015 09:01 PM, Wes Vaske (wvaske) wrote:
> Regarding:
> “lie about their fsync status.”
>
> This is mostly semantics but it might help google searches on the issue.
>
> A drive doesn’t support fsync(), that’s a filesystem/kernel process. A drive will do a FLUSH CACHE. Before kernels
2.6.<lownumbers> the fsync() call wouldn’t sent any ATA or SCSI command to flush the disk cache. Whereas—AFAICT—modern
kernelsand file system versions*will*  do this. When ‘sync’ is called the filesystem will issue the appropriate command
tothe disk to flush the write cache. 
>
> For ATA, this is “FLUSH CACHE” (E7h). To check support for the command use:
> [root@postgres ~]# smartctl --identify /dev/sdu | grep "FLUSH CACHE"
>    83     13          1   FLUSH CACHE EXT supported
>    83     12          1   FLUSH CACHE supported
>    86     13          1   FLUSH CACHE EXT supported
>    86     12          1   FLUSH CACHE supported
>
> The 1s in the 3rd column represent SUPPORTED for the feature listed in the last column.

Right, to be precise, the problem isn't the drive lies about fsync(). It
lies about FLUSH CACHE instead. Search & replace fsync() with FLUSH
CACHE, and the same question remains: When the drive breaks its promise
wrt. FLUSH CACHE, does it nevertheless guarantee that the order the data
is eventually flushed to disk is consistent with the order in which the
data and FLUSH CACHE were sent to the drive? That's an important
distinction, because it makes the difference between "the most recent
data the application saved might be lost even though the FLUSH CACHE
command returned" and "your filesystem is corrupt".

- Heikki



Re: New server: SSD/RAID recommendations?

From
"Graeme B. Bell"
Date:
Cache flushing isn't an atomic operation though. Even if the ordering is right, you are likely to have a partial fsync
onthe disk when the lights go out - isn't your FS still corrupt? 

On 07 Jul 2015, at 21:53, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> On 07/07/2015 09:01 PM, Wes Vaske (wvaske) wrote:
>
> Right, to be precise, the problem isn't the drive lies about fsync(). It lies about FLUSH CACHE instead. Search &
replacefsync() with FLUSH CACHE, and the same question remains: When the drive breaks its promise wrt. FLUSH CACHE,
doesit nevertheless guarantee that the order the data is eventually flushed to disk is consistent with the order in
whichthe data and FLUSH CACHE were sent to the drive? That's an important distinction, because it makes the difference
between"the most recent data the application saved might be lost even though the FLUSH CACHE command returned" and
"yourfilesystem is corrupt". 
>



Re: New server: SSD/RAID recommendations?

From
Heikki Linnakangas
Date:
On 07/07/2015 10:59 PM, Graeme B. Bell wrote:
> Cache flushing isn't an atomic operation though. Even if the ordering
> is right, you are likely to have a partial fsync on the disk when the
> lights go out - isn't your FS still corrupt?

If the filesystem is worth its salt, no. Journaling filesystems for
example rely on the journal to work around that problem, and there are
other mechanisms.

PostgreSQL has exactly the same problem and uses the WAL to solve it.

- Heikki