Thread: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

[PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

From
Pietro Pugni
Date:
Hi there,
I configured an IBM X3650 M4 for development and testing purposes. It’s composed by:
 - 2 x Intel Xeon E5-2690 @ 2.90Ghz (2 x 8 physical Cores + HT)
 - 96GB RAM DDR3 1333MHz (12 x 8GB)
 - 2 x 146GB SAS HDDs @ 15k rpm configured in RAID1 (mdadm)
 - 6 x 525GB SATA SSDs (over-provisioned at 25%, so 393GB available)

I’ve done a lot of testing focusing on 4k and 8k workloads and found that the IOPS of those SSDs are half the expected. On serverfault.com someone suggested me that probably the bottle neck is the embedded RAID controller, a IBM ServeRaid m5110e, which mounts a LSI 2008 controller.

I’m using the disks in JBOD mode with mdadm software RAID, which is blazing fast. The CPU is also very fast, so I don’t mind having a little overhead due to software RAID.

My typical workload is Postgres run as a DWH with 1 to 2 billions of rows, big indexes, partitions and so on, but also intensive statistical computations.


and here’s a graph of those six SSDs evaluated using fio as stand-alone disks (outside of the RAID):

https://i.stack.imgur.com/ZMhUJ.png

All those IOPS should be doubled if all was working correctly. The curve trend is correct for increasing IO Depths.


Anyway, I would like to buy a HBA controller that leverages those 6 SSDs. Each SSD should deliver about 80k to 90k IOPS, so in RAID10 I should get ~240k IOPS (6 x 80k / 2) and in RAID0 ~480k IOPS (6 x 80k). I’ve seen that mdadm effectively scales performance, but the controller limits the overal IOPS at ~120k (exactly the half of the expected IOPS).

What HBA controller would you suggest me able to handle 500k IOPS? 


My server is able to handle 8 more SSDs, for a total of 14 SSDs and 1260k theoretical IOPS. If we imagine adding only 2 more disks, I will achieve 720k theoretical IOPS in RAID0. 

What HBA controller would you suggest me able to handle more than 700k IOPS? 

Have you got some advices about using mdadm RAID software on SATAIII SSDs and plain HBA?

Thank you everyone
   Pietro Pugni



Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

From
Merlin Moncure
Date:
On Tue, Feb 21, 2017 at 7:49 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
Hi there,
I configured an IBM X3650 M4 for development and testing purposes. It’s composed by:
 - 2 x Intel Xeon E5-2690 @ 2.90Ghz (2 x 8 physical Cores + HT)
 - 96GB RAM DDR3 1333MHz (12 x 8GB)
 - 2 x 146GB SAS HDDs @ 15k rpm configured in RAID1 (mdadm)
 - 6 x 525GB SATA SSDs (over-provisioned at 25%, so 393GB available)

I’ve done a lot of testing focusing on 4k and 8k workloads and found that the IOPS of those SSDs are half the expected. On serverfault.com someone suggested me that probably the bottle neck is the embedded RAID controller, a IBM ServeRaid m5110e, which mounts a LSI 2008 controller.

I’m using the disks in JBOD mode with mdadm software RAID, which is blazing fast. The CPU is also very fast, so I don’t mind having a little overhead due to software RAID.

My typical workload is Postgres run as a DWH with 1 to 2 billions of rows, big indexes, partitions and so on, but also intensive statistical computations.


and here’s a graph of those six SSDs evaluated using fio as stand-alone disks (outside of the RAID):

https://i.stack.imgur.com/ZMhUJ.png

All those IOPS should be doubled if all was working correctly. The curve trend is correct for increasing IO Depths.


Anyway, I would like to buy a HBA controller that leverages those 6 SSDs. Each SSD should deliver about 80k to 90k IOPS, so in RAID10 I should get ~240k IOPS (6 x 80k / 2) and in RAID0 ~480k IOPS (6 x 80k). I’ve seen that mdadm effectively scales performance, but the controller limits the overal IOPS at ~120k (exactly the half of the expected IOPS).

What HBA controller would you suggest me able to handle 500k IOPS? 


My server is able to handle 8 more SSDs, for a total of 14 SSDs and 1260k theoretical IOPS. If we imagine adding only 2 more disks, I will achieve 720k theoretical IOPS in RAID0. 

What HBA controller would you suggest me able to handle more than 700k IOPS? 

Have you got some advices about using mdadm RAID software on SATAIII SSDs and plain HBA?

Random points/suggestions:
*) mdadm is the way to go.   I think you'll get bandwidth constrained on most modern hba unless they are really crappy.  On reasonably modern hardware storage is rarely the bottleneck anymore (which is a great place to be).  Fancy raid controllers may actually hurt performance -- they are obsolete IMNSHO.

*) Small point, but you'll want to crank effective_io_concurrency (see: https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH%3DHWh1WbLRsioe%3DmzRJTHwtr%3D2azsTdQ%40mail.gmail.com). It only affects certain kinds of queries, but when it works it really works.    Those benchmarks were done on my crapbox dell workstation!

*) For very high transaction rates, you can get a lot of benefit from disabling synchronous_commit if you are willing to accommodate the risk.  I do not recommend disabling fsync unless you are prepared to regenerate the entire database at any time.

*) Don't assume indefinite linear scaling as you increase storage capacity -- the database itself can become the bottleneck, especially for writing.   To improve write performance, classic optimization strategies of trying to intelligently bundle writes around units of work still apply.  If you are expecting high rates of write activity your engineering focus needs to be here for sure (read scaling is comparatively pretty easy).

*) I would start doing your benchmarking with pgbench since that is going to most closely reflect measured production performance.  

> My typical workload is Postgres run as a DWH with 1 to 2 billions of rows, big indexes, partitions and so on, but also intensive statistical computations.

If this is the case your stack performance is going to be based on data structure design.  Make liberal use of:
*) natural keys 
*) constraint exclusion for partition selection
*) BRIN index is amazing (if you can work into it's limitations)
*) partial indexing
*) covering indexes.  Don't forget to vacuum your partitions before you make them live if you use them

If your data is going to get really big and/or query activity is expected to be high, keep an eye on your scale out strategy.   Going monolithic to bootstrap your app is the right choice IMO but start thinking about the longer term if you are expecting growth.  I'm starting to come out to the perspective that lift/shift scaleout using postgres fdw without an insane amount of app retooling could be a viable option by postgres 11/12 or so.  For my part I scaled out over asynchronous dblink which is a much more maintenance heavy strategy (but works fabulous although I which you could asynchronously connect).

merlin

Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madamRAID10)

From
"Wes Vaske (wvaske)"
Date:

Disclaimer: I’ve done extensive testing (FIO and postgres) with a few different RAID controllers and HW RAID vs mdadm. We (micron) are crucial but I don’t personally work with the consumer drives.

 

Verify whether you have your disk write cache enabled or disabled. If it’s disabled, that will have a large impact on write performance.

 

Is this the *exact* string you used? `fio --filename=/dev/sdx --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest`

 

With FIO, you need to multiply iodepth by numjobs to get the final queue depth its pushing. (in this case, 256). Make sure you’re looking at the correct data.

 

Few other things:

-          Mdadm will give better performance than HW RAID for specific benchmarks.

-          Performance is NOT linear with drive count for synthetic benchmarks.

-          It is often nearly linear for application performance.

-          HW RAID can give better performance if your drives do not have a capacitor backed cache (like the MX300) AND the controller has a battery backed cache. *Consumer drives can often get better performance from HW RAID*. (otherwise MDADM has been faster in all of my testing)

-          Mdadm RAID10 has a bug where reads are not properly distributed between the mirror pairs. (It uses head position calculated from the last IO to determine which drive in a mirror pair should get the next read. It results in really weird behavior of most read IO going to half of your drives instead of being evenly split as should be the case for SSDs).  You can see this by running iostat while you’ve got a load running and you’ll see uneven distribution of IOs. FYI, the RAID1 implementation has an exception where it does NOT use head position for SSDs. I have yet to test this but you should be able to get better performance by manually striping a RAID0 across multiple RAID1s instead of using the default RAID10 implementation.

-          Don’t focus on 4k Random Read. Do something more similar to a PG workload (64k 70/30 R/W @ QD=4 is *reasonably* close to what I see for heavy OLTP). I’ve tested multiple controllers based on the LSI 3108 and found that default settings from one vendor to another provide drastically different performance profiles. Vendor A had much better benchmark performance (2x IOPS of B) while vendor B gave better application performance (20% better OLTP performance in Postgres). (I got equivalent performance from A & B when using the same settings).

 

 

Wes Vaske

Senior Storage Solutions Engineer

Micron Technology

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Merlin Moncure
Sent: Tuesday, February 21, 2017 9:05 AM
To: Pietro Pugni <pietro.pugni@gmail.com>
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

 

On Tue, Feb 21, 2017 at 7:49 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:

Hi there,

I configured an IBM X3650 M4 for development and testing purposes. It’s composed by:

 - 2 x Intel Xeon E5-2690 @ 2.90Ghz (2 x 8 physical Cores + HT)

 - 96GB RAM DDR3 1333MHz (12 x 8GB)

 - 2 x 146GB SAS HDDs @ 15k rpm configured in RAID1 (mdadm)

 - 6 x 525GB SATA SSDs (over-provisioned at 25%, so 393GB available)

 

I’ve done a lot of testing focusing on 4k and 8k workloads and found that the IOPS of those SSDs are half the expected. On serverfault.com someone suggested me that probably the bottle neck is the embedded RAID controller, a IBM ServeRaid m5110e, which mounts a LSI 2008 controller.

 

I’m using the disks in JBOD mode with mdadm software RAID, which is blazing fast. The CPU is also very fast, so I don’t mind having a little overhead due to software RAID.

 

My typical workload is Postgres run as a DWH with 1 to 2 billions of rows, big indexes, partitions and so on, but also intensive statistical computations.

 

 

and here’s a graph of those six SSDs evaluated using fio as stand-alone disks (outside of the RAID):

 

https://i.stack.imgur.com/ZMhUJ.png

 

All those IOPS should be doubled if all was working correctly. The curve trend is correct for increasing IO Depths.

 

 

Anyway, I would like to buy a HBA controller that leverages those 6 SSDs. Each SSD should deliver about 80k to 90k IOPS, so in RAID10 I should get ~240k IOPS (6 x 80k / 2) and in RAID0 ~480k IOPS (6 x 80k). I’ve seen that mdadm effectively scales performance, but the controller limits the overal IOPS at ~120k (exactly the half of the expected IOPS).

 

What HBA controller would you suggest me able to handle 500k IOPS? 

 

 

My server is able to handle 8 more SSDs, for a total of 14 SSDs and 1260k theoretical IOPS. If we imagine adding only 2 more disks, I will achieve 720k theoretical IOPS in RAID0. 

 

What HBA controller would you suggest me able to handle more than 700k IOPS? 

 

Have you got some advices about using mdadm RAID software on SATAIII SSDs and plain HBA?

 

Random points/suggestions:

*) mdadm is the way to go.   I think you'll get bandwidth constrained on most modern hba unless they are really crappy.  On reasonably modern hardware storage is rarely the bottleneck anymore (which is a great place to be).  Fancy raid controllers may actually hurt performance -- they are obsolete IMNSHO.

 

*) Small point, but you'll want to crank effective_io_concurrency (see: https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH%3DHWh1WbLRsioe%3DmzRJTHwtr%3D2azsTdQ%40mail.gmail.com). It only affects certain kinds of queries, but when it works it really works.    Those benchmarks were done on my crapbox dell workstation!

 

*) For very high transaction rates, you can get a lot of benefit from disabling synchronous_commit if you are willing to accommodate the risk.  I do not recommend disabling fsync unless you are prepared to regenerate the entire database at any time.

 

*) Don't assume indefinite linear scaling as you increase storage capacity -- the database itself can become the bottleneck, especially for writing.   To improve write performance, classic optimization strategies of trying to intelligently bundle writes around units of work still apply.  If you are expecting high rates of write activity your engineering focus needs to be here for sure (read scaling is comparatively pretty easy).

 

*) I would start doing your benchmarking with pgbench since that is going to most closely reflect measured production performance.  

 

> My typical workload is Postgres run as a DWH with 1 to 2 billions of rows, big indexes, partitions and so on, but also intensive statistical computations.

 

If this is the case your stack performance is going to be based on data structure design.  Make liberal use of:

*) natural keys 

*) constraint exclusion for partition selection

*) BRIN index is amazing (if you can work into it's limitations)

*) partial indexing

*) covering indexes.  Don't forget to vacuum your partitions before you make them live if you use them

 

If your data is going to get really big and/or query activity is expected to be high, keep an eye on your scale out strategy.   Going monolithic to bootstrap your app is the right choice IMO but start thinking about the longer term if you are expecting growth.  I'm starting to come out to the perspective that lift/shift scaleout using postgres fdw without an insane amount of app retooling could be a viable option by postgres 11/12 or so.  For my part I scaled out over asynchronous dblink which is a much more maintenance heavy strategy (but works fabulous although I which you could asynchronously connect).

 

merlin

 

Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

From
Merlin Moncure
Date:
On Tue, Feb 21, 2017 at 1:40 PM, Wes Vaske (wvaske) <wvaske@micron.com> wrote:

-          HW RAID can give better performance if your drives do not have a capacitor backed cache (like the MX300) AND the controller has a battery backed cache. *Consumer drives can often get better performance from HW RAID*. (otherwise MDADM has been faster in all of my testing)


I stopped recommending non-capacitor drives a long time ago for databases.  A capacitor is basically a battery that operates on the drive itself and is not subject to chemical failure.   Also, drives without capacitors tend not (in my direct experience) to be suitable for database use in any scenario where write performance matters.  There are capacitor equipped drives that give excellent performance for around .60$/gb.  I'm curious what the entry point is for micron models are capacitor enabled...

MLC solid state drives are essentially raid systems already with very complex tradeoffs engineered into the controller itself -- hw raid controllers are redundant systems and their price and added latency to filesystem calls is not warranted.  I guess in theory a SSD specialized raid controller could cooperate with the drives and do things like manage wear leveling across multiple devices but AFAIK no such product exists (note: I haven't looked lately).

merlin

Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

From
Scott Marlowe
Date:
Suggestion #1 is to turn off any write caching on the RAID controller. Using LSI MegaRAID we went from 3k to 5k tps to 18k just turning off write caching. Basically it just got in the way.

Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madamRAID10)

From
"Wes Vaske (wvaske)"
Date:

> I'm curious what the entry point is for micron models are capacitor enabled...

 

The 5100 is the entry SATA drive with full power loss protection.

http://www.anandtech.com/show/10886/micron-announces-5100-series-enterprise-sata-ssds-with-3d-tlc-nand

 

Fun Fact: 3D TLC can give better endurance than planar MLC. http://www.chipworks.com/about-chipworks/overview/blog/intelmicron-detail-their-3d-nand-iedm

 

My understanding (and I’m not a process or electrical engineer) is that the 3D cell size is significantly larger than what was being used for planar (Samsung’s 3D is reportedly a ~40nm class device vs our most recent planar which is 16nm). This results in many more electrons per cell which provides better endurance.

 

 

Wes Vaske

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Merlin Moncure
Sent: Tuesday, February 21, 2017 2:02 PM
To: Wes Vaske (wvaske) <wvaske@micron.com>
Cc: Pietro Pugni <pietro.pugni@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

 

On Tue, Feb 21, 2017 at 1:40 PM, Wes Vaske (wvaske) <wvaske@micron.com> wrote:

-          HW RAID can give better performance if your drives do not have a capacitor backed cache (like the MX300) AND the controller has a battery backed cache. *Consumer drives can often get better performance from HW RAID*. (otherwise MDADM has been faster in all of my testing)

 

I stopped recommending non-capacitor drives a long time ago for databases.  A capacitor is basically a battery that operates on the drive itself and is not subject to chemical failure.   Also, drives without capacitors tend not (in my direct experience) to be suitable for database use in any scenario where write performance matters.  There are capacitor equipped drives that give excellent performance for around .60$/gb.  I'm curious what the entry point is for micron models are capacitor enabled...

 

MLC solid state drives are essentially raid systems already with very complex tradeoffs engineered into the controller itself -- hw raid controllers are redundant systems and their price and added latency to filesystem calls is not warranted.  I guess in theory a SSD specialized raid controller could cooperate with the drives and do things like manage wear leveling across multiple devices but AFAIK no such product exists (note: I haven't looked lately).

 

merlin

 

Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

From
Pietro Pugni
Date:
Thank you a lot for your suggestions.

Random points/suggestions:
*) mdadm is the way to go.   I think you'll get bandwidth constrained on most modern hba unless they are really crappy.  On reasonably modern hardware storage is rarely the bottleneck anymore (which is a great place to be).  Fancy raid controllers may actually hurt performance -- they are obsolete IMNSHO.

*) Small point, but you'll want to crank effective_io_concurrency (see: https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH%3DHWh1WbLRsioe%3DmzRJTHwtr%3D2azsTdQ%40mail.gmail.com). It only affects certain kinds of queries, but when it works it really works.    Those benchmarks were done on my crapbox dell workstation!

This is really impressive, thank you for sharing it.

*) For very high transaction rates, you can get a lot of benefit from disabling synchronous_commit if you are willing to accommodate the risk.  I do not recommend disabling fsync unless you are prepared to regenerate the entire database at any time.

*) Don't assume indefinite linear scaling as you increase storage capacity -- the database itself can become the bottleneck, especially for writing.   To improve write performance, classic optimization strategies of trying to intelligently bundle writes around units of work still apply.  If you are expecting high rates of write activity your engineering focus needs to be here for sure (read scaling is comparatively pretty easy).

What do you mean with “units of work”?


*) I would start doing your benchmarking with pgbench since that is going to most closely reflect measured production performance.  

My final benchmark will be my application, it’s quite articulated and does also query parallelization using a custom splitter, so it will be hard to reproduce it using pgbench. At the moment I was just figuring out why my SSD weren’t performing as expected with comparable benchmarks found on hardware review websites (fio with 4k and 8k workloads). 


If this is the case your stack performance is going to be based on data structure design.  Make liberal use of:
*) natural keys 
*) constraint exclusion for partition selection
*) BRIN index is amazing (if you can work into it's limitations)
*) partial indexing
*) covering indexes.  Don't forget to vacuum your partitions before you make them live if you use them

The data definition is optimized for Postgres yet, but didn’t know about covering indexes. I read about BRIN but never tried them. Will do some testing.


If your data is going to get really big and/or query activity is expected to be high, keep an eye on your scale out strategy.   Going monolithic to bootstrap your app is the right choice IMO but start thinking about the longer term if you are expecting growth.  I'm starting to come out to the perspective that lift/shift scaleout using postgres fdw without an insane amount of app retooling could be a viable option by postgres 11/12 or so.  For my part I scaled out over asynchronous dblink which is a much more maintenance heavy strategy (but works fabulous although I which you could asynchronously connect).

Thank you for your hints 

 Pietro Pugni

Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

From
Pietro Pugni
Date:
Disclaimer: I’ve done extensive testing (FIO and postgres) with a few different RAID controllers and HW RAID vs mdadm. We (micron) are crucial but I don’t personally work with the consumer drives.
 
Verify whether you have your disk write cache enabled or disabled. If it’s disabled, that will have a large impact on write performance. 

What an honor :)
My SSDs are Crucial MX300 (consumer drives) but, as previously stated, they gave ~90k IOPS in all benchmarks I found on the web, while mine tops at ~40k IOPS. Being 6 devices bought from 4 different sellers it’s impossible that they are all defective.

Is this the *exact* string you used? `fio --filename=/dev/sdx --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest`
 
With FIO, you need to multiply iodepth by numjobs to get the final queue depth its pushing. (in this case, 256). Make sure you’re looking at the correct data.

I used —numjobs=1 because I needed the time series values for bandwidth, latencies and iops. The command string was the same, except from varying IO Depth and numjobs=1.


Few other things:
-          Mdadm will give better performance than HW RAID for specific benchmarks.
-          Performance is NOT linear with drive count for synthetic benchmarks.
-          It is often nearly linear for application performance.

mdadm RAID10 scaled linearly while mdadm RAID0 scaled much less.


-          HW RAID can give better performance if your drives do not have a capacitor backed cache (like the MX300) AND the controller has a battery backed cache. *Consumer drives can often get better performance from HW RAID*. (otherwise MDADM has been faster in all of my testing)

My RAID controller doesn’t have a BBU.


-          Mdadm RAID10 has a bug where reads are not properly distributed between the mirror pairs. (It uses head position calculated from the last IO to determine which drive in a mirror pair should get the next read. It results in really weird behavior of most read IO going to half of your drives instead of being evenly split as should be the case for SSDs).  You can see this by running iostat while you’ve got a load running and you’ll see uneven distribution of IOs. FYI, the RAID1 implementation has an exception where it does NOT use head position for SSDs. I have yet to test this but you should be able to get better performance by manually striping a RAID0 across multiple RAID1s instead of using the default RAID10 implementation.

Very interesting. I will double check this after buying and mounting the new HBA. I heard of someone doing what you are suggesting but never tried.


-          Don’t focus on 4k Random Read. Do something more similar to a PG workload (64k 70/30 R/W @ QD=4 is *reasonably* close to what I see for heavy OLTP).

Why 64k and QD=4? I thought of 8k and larger QD. Will test as soon as possible and report here the results :)


I’ve tested multiple controllers based on the LSI 3108 and found that default settings from one vendor to another provide drastically different performance profiles. Vendor A had much better benchmark performance (2x IOPS of B) while vendor B gave better application performance (20% better OLTP performance in Postgres). (I got equivalent performance from A & B when using the same settings). 

Do you have some HBA card to suggest? What do you think of LSI SAS3008? I think it’s the same as the 3108 without RAID On Chip feature. Probably I will buy a Lenovo HBA card with that chip. It seems blazing fast (1mln IOPS) compared to the actual embedded RAID controller (LSI 2008).
I don’t know if I can connect a 12Gb/s HBA directly to my existing 6Gb/s expander/backplane.. sure I will have the right cables but don’t know if it will work without changing the expander/backplane.


Thank you a lot for your time
 Pietro Pugni



Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

From
Pietro Pugni
Date:
> Suggestion #1 is to turn off any write caching on the RAID controller. Using LSI MegaRAID we went from 3k to 5k tps
to18k just turning off write caching. Basically it just got in the way. 

Write caching is disabled because I removed the expansion card of the RAID controller. It didn’t let the server boot
properlywith SSDs mounted. 

Thank you for the suggestion
 Pietro Pugni

Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madamRAID10)

From
"Wes Vaske (wvaske)"
Date:

> I used —numjobs=1 because I needed the time series values for bandwidth, latencies and iops. The command string was the same, except from varying IO Depth and numjobs=1.

 

You might need to increase the number of jobs here. The primary reason for this parameter is to improve scaling when you’re single thread CPU bound. With numjob=1 FIO will use only a single thread and there’s only so much a single CPU core can do.

 

> Being 6 devices bought from 4 different sellers it’s impossible that they are all defective.

 

I was a little unclear on the disk cache part. It’s a setting, generally in the RAID controller / HBA. It’s also a filesystem level option in Linux (hdparm) and Windows (somewhere in device manager?). The reason to disable the disk cache is that it’s NOT protected against power loss protection on the MX300. So by disabling it you can ensure 100% write consistency at the cost of write performance. (using fully power protected drives will let you keep disk cache enabled)

 

> Why 64k and QD=4? I thought of 8k and larger QD. Will test as soon as possible and report here the results :)

 

It’s more representative of what you’ll see at the application level. (If you’ve got a running system, you can just use IOstat to see what your average QD is. (iostat -x 10, and it’s the column: avgqu-sz. Change from 10 seconds to whatever interval works best for your environment)

 

> Do you have some HBA card to suggest? What do you think of LSI SAS3008? I think it’s the same as the 3108 without RAID On Chip feature. Probably I will buy a Lenovo HBA card with that chip. It seems blazing fast (1mln IOPS) compared to the actual embedded RAID controller (LSI 2008).

 

I’ve been able to consistently get the same performance out of any of the LSI based cards. The 3008 and 3108 both work great, regardless of vendor. Just test or read up on the different configuration parameters (read ahead, write back vs write through, disk cache)

 

 

Wes Vaske

Senior Storage Solutions Engineer

Micron Technology

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Pietro Pugni
Sent: Tuesday, February 21, 2017 5:44 PM
To: Wes Vaske (wvaske) <wvaske@micron.com>
Cc: Merlin Moncure <mmoncure@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

 

Disclaimer: I’ve done extensive testing (FIO and postgres) with a few different RAID controllers and HW RAID vs mdadm. We (micron) are crucial but I don’t personally work with the consumer drives.

 

Verify whether you have your disk write cache enabled or disabled. If it’s disabled, that will have a large impact on write performance. 

 

What an honor :)

My SSDs are Crucial MX300 (consumer drives) but, as previously stated, they gave ~90k IOPS in all benchmarks I found on the web, while mine tops at ~40k IOPS. Being 6 devices bought from 4 different sellers it’s impossible that they are all defective.



Is this the *exact* string you used? `fio --filename=/dev/sdx --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest`

 

With FIO, you need to multiply iodepth by numjobs to get the final queue depth its pushing. (in this case, 256). Make sure you’re looking at the correct data.

 

I used —numjobs=1 because I needed the time series values for bandwidth, latencies and iops. The command string was the same, except from varying IO Depth and numjobs=1.

 



Few other things:

-          Mdadm will give better performance than HW RAID for specific benchmarks.

-          Performance is NOT linear with drive count for synthetic benchmarks.

-          It is often nearly linear for application performance.

 

mdadm RAID10 scaled linearly while mdadm RAID0 scaled much less.

 



-          HW RAID can give better performance if your drives do not have a capacitor backed cache (like the MX300) AND the controller has a battery backed cache. *Consumer drives can often get better performance from HW RAID*. (otherwise MDADM has been faster in all of my testing)

 

My RAID controller doesn’t have a BBU.

 



-          Mdadm RAID10 has a bug where reads are not properly distributed between the mirror pairs. (It uses head position calculated from the last IO to determine which drive in a mirror pair should get the next read. It results in really weird behavior of most read IO going to half of your drives instead of being evenly split as should be the case for SSDs).  You can see this by running iostat while you’ve got a load running and you’ll see uneven distribution of IOs. FYI, the RAID1 implementation has an exception where it does NOT use head position for SSDs. I have yet to test this but you should be able to get better performance by manually striping a RAID0 across multiple RAID1s instead of using the default RAID10 implementation.

 

Very interesting. I will double check this after buying and mounting the new HBA. I heard of someone doing what you are suggesting but never tried.

 



-          Don’t focus on 4k Random Read. Do something more similar to a PG workload (64k 70/30 R/W @ QD=4 is *reasonably* close to what I see for heavy OLTP).

 

Why 64k and QD=4? I thought of 8k and larger QD. Will test as soon as possible and report here the results :)

 



I’ve tested multiple controllers based on the LSI 3108 and found that default settings from one vendor to another provide drastically different performance profiles. Vendor A had much better benchmark performance (2x IOPS of B) while vendor B gave better application performance (20% better OLTP performance in Postgres). (I got equivalent performance from A & B when using the same settings). 

 

Do you have some HBA card to suggest? What do you think of LSI SAS3008? I think it’s the same as the 3108 without RAID On Chip feature. Probably I will buy a Lenovo HBA card with that chip. It seems blazing fast (1mln IOPS) compared to the actual embedded RAID controller (LSI 2008).

I don’t know if I can connect a 12Gb/s HBA directly to my existing 6Gb/s expander/backplane.. sure I will have the right cables but don’t know if it will work without changing the expander/backplane.

 

 

Thank you a lot for your time

 Pietro Pugni

 

 

 

Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

From
Pietro Pugni
Date:
I just mounted and configured my brand new LSI 3008-8i. This server had 1 SAS expander connected to 2 backplanes (8 disks to the first backplane and no disks connected to the second backplane). After some testing I found the SAS expander was a bottleneck, so I removed it and connected the first backplane directly to the controller. 

The following results are from 4k 100% random reads (32QD) run in parallel on each single SSD:

Raw SSDs [ 4k, 100% random reads, 32 Queue Depth]
ServeRaid m5110e (with SAS expander) [numjob=1]
  read : io=5111.2MB, bw=87227KB/s, iops=21806, runt= 60002msec
  read : io=4800.6MB, bw=81927KB/s, iops=20481, runt= 60002msec
  read : io=4997.6MB, bw=85288KB/s, iops=21322, runt= 60002msec
  read : io=4796.2MB, bw=81853KB/s, iops=20463, runt= 60001msec
  read : io=5062.6MB, bw=86400KB/s, iops=21599, runt= 60001msec
  read : io=4989.6MB, bw=85154KB/s, iops=21288, runt= 60001msec
Total read iops: 126,595 ( ~ 21,160 iops/disk)


Raw SSDs [ 4k, 100% random reads, 32 Queue Depth]
Lenovo N2215 (LSI 3008-8i flashed with LSI IT firmware, without SAS expander) [numjob=1]
  read : io=15032MB, bw=256544KB/s, iops=64136, runt= 60001msec
  read : io=16679MB, bw=284656KB/s, iops=71163, runt= 60001msec
  read : io=15046MB, bw=256779KB/s, iops=64194, runt= 60001msec
  read : io=16667MB, bw=284444KB/s, iops=71111, runt= 60001msec
  read : io=16692MB, bw=284867KB/s, iops=71216, runt= 60001msec
  read : io=15149MB, bw=258534KB/s, iops=64633, runt= 60002msec
Total read iops: 406,453 ( ~ 67,742 iops/disk)


321% performance improvement.
I chose 4k 32QD because it should deliver the maximum iops and should clearly show if the I/O is properly configured.
I don’t mind testing the embedded m5110e without the SAS expander because it will be slower for sure. 


You might need to increase the number of jobs here. The primary reason for this parameter is to improve scaling when you’re single thread CPU bound. With numjob=1 FIO will use only a single thread and there’s only so much a single CPU core can do.

The HBA provided slightly better performance without removing the expander and even more slightly faster after removing the expander, but then I tried increasing numjob from 1 to 16 (tried also 12, 18, 20, 24 and 32 but found 16 to get higher iops) and the benchmarks returned expected results. I guess how this relates with Postgres.. probably effective_io_concurrency, as suggested by Merlin Moncure, should be the counterpart of numjob in fio?


I was a little unclear on the disk cache part. It’s a setting, generally in the RAID controller / HBA. It’s also a filesystem level option in Linux (hdparm) and Windows (somewhere in device manager?). The reason to disable the disk cache is that it’s NOT protected against power loss protection on the MX300. So by disabling it you can ensure 100% write consistency at the cost of write performance. (using fully power protected drives will let you keep disk cache enabled)

I always enabled the write cache during my tests. I tried to disable it but performance were too poor. Those SSD are consumer ones and don’t have any capacitor :(


> Why 64k and QD=4? I thought of 8k and larger QD. Will test as soon as possible and report here the results :)
 
It’s more representative of what you’ll see at the application level. (If you’ve got a running system, you can just use IOstat to see what your average QD is. (iostat -x 10, and it’s the column: avgqu-sz. Change from 10 seconds to whatever interval works best for your environment)

I tried your suggestion (64k, 70/30 random r/w, 4QD) on RAID0 and RAID10 (mdadm) with the new controller and the results are quite good if we think that the underlying SSDs are consumer with original firmware (overprovisioned at 25%).

RAID10 is about 22% slower in both reads and writes compared to RAID0, at least on a 1 minute run. The totals and averages were calculated from the whole fio log output using the single jobs iops.

These are the results:


############################################################################
mdadm RAID0 [ 64k, 70% random reads, 30% random writes, 04 Queue Depth]
Lenovo N2215 (LSI 3008-8i flashed with LSI IT firmware, without SAS expander) [numjob=16]
############################################################################
Run status group 0 (all jobs):
   READ: io=75943MB, aggrb=1265.7MB/s, minb=80445KB/s, maxb=81576KB/s, mint=60001msec, maxt=60004msec
  WRITE: io=32585MB, aggrb=556072KB/s, minb=34220KB/s, maxb=35098KB/s, mint=60001msec, maxt=60004msec

Disk stats (read/write):
    md127: ios=1213256/520566, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=202541/86892, aggrmerge=0/0, aggrticks=490418/137398, aggrin_queue=628566, aggrutil=99.20%
  sdf: ios=202557/86818, merge=0/0, ticks=450384/131512, in_queue=582528, util=98.58%
  sdb: ios=202626/87184, merge=0/0, ticks=573448/177336, in_queue=751784, util=99.20%
  sdg: ios=202391/86810, merge=0/0, ticks=463644/137084, in_queue=601272, util=98.46%
  sde: ios=202462/86551, merge=0/0, ticks=470028/121424, in_queue=592500, util=98.79%
  sda: ios=202287/86697, merge=0/0, ticks=473312/121192, in_queue=595044, util=98.95%
  sdh: ios=202928/87293, merge=0/0, ticks=511696/135840, in_queue=648272, util=99.14%

Total read iops: 20,242 ( ~ 3,374 iops/disk)
Total write iops: 8,679 ( ~ 1,447 iops/disk)



############################################################################
mdadm RAID10 [ 64k, 70% random reads, 30% random writes, 04 Queue Depth]
Lenovo N2215 (LSI 3008-8i flashed with LSI IT firmware, without SAS expander) [numjob=16]
############################################################################
Run status group 0 (all jobs):
   READ: io=58624MB, aggrb=976.11MB/s, minb=62125KB/s, maxb=62814KB/s, mint=60001msec, maxt=60005msec
  WRITE: io=25190MB, aggrb=429874KB/s, minb=26446KB/s, maxb=27075KB/s, mint=60001msec, maxt=60005msec

Disk stats (read/write):
    md127: ios=936349/402381, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=156357/134348, aggrmerge=0/0, aggrticks=433286/262226, aggrin_queue=696052, aggrutil=99.41%
  sdf: ios=150239/134315, merge=0/0, ticks=298268/168472, in_queue=466852, util=95.31%
  sdb: ios=153088/133664, merge=0/0, ticks=329160/188060, in_queue=517432, util=96.81%
  sdg: ios=157361/135065, merge=0/0, ticks=658208/459168, in_queue=1118588, util=99.16%
  sde: ios=161361/134315, merge=0/0, ticks=476388/278628, in_queue=756056, util=97.61%
  sda: ios=160431/133664, merge=0/0, ticks=548620/329708, in_queue=878708, util=99.41%
  sdh: ios=155667/135065, merge=0/0, ticks=289072/149324, in_queue=438680, util=96.71%

Total read iops: 15,625 ( ~ 2,604 iops/disk)
Total write iops: 6,709 ( ~ 1,118 iops/disk)



> Do you have some HBA card to suggest? What do you think of LSI SAS3008? I think it’s the same as the 3108 without RAID On Chip feature. Probably I will buy a Lenovo HBA card with that chip. It seems blazing fast (1mln IOPS) compared to the actual embedded RAID controller (LSI 2008).
 
I’ve been able to consistently get the same performance out of any of the LSI based cards. The 3008 and 3108 both work great, regardless of vendor. Just test or read up on the different configuration parameters (read ahead, write back vs write through, disk cache)

Do you have any suggestion for fine tuning this controller? I’m referring to parameters like nr_requests, queue_depth, etc.
Also, any way to optimize the various mdadm parameters available at /sys/block/mdX/ ? I disabled the internal bitmap and write performance improved.



Thank you
 Pietro Pugni



Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)

From
Merlin Moncure
Date:
On Thu, Mar 2, 2017 at 3:51 PM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
> The HBA provided slightly better performance without removing the expander
> and even more slightly faster after removing the expander, but then I tried
> increasing numjob from 1 to 16 (tried also 12, 18, 20, 24 and 32 but found
> 16 to get higher iops) and the benchmarks returned expected results. I guess
> how this relates with Postgres.. probably effective_io_concurrency, as
> suggested by Merlin Moncure, should be the counterpart of numjob in fio?

Kind of.  effective_io_concurrency allows the database to send >1
filesystem commands to the hardware from a single process.  Sadly,
only certain classes of query can currently leverage this factility --
as you can see, it's a huge optimization.

merlin