Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madamRAID10) - Mailing list pgsql-performance
From | Wes Vaske (wvaske) |
---|---|
Subject | Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madamRAID10) |
Date | |
Msg-id | ea5b07eec3ac467985889e2e1cf4e301@bowex17a.micron.com Whole thread Raw |
In response to | Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10) (Pietro Pugni <pietro.pugni@gmail.com>) |
Responses |
Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)
|
List | pgsql-performance |
> I used —numjobs=1 because I needed the time series values for bandwidth, latencies and iops. The command string was the same, except from varying IO Depth and numjobs=1.
You might need to increase the number of jobs here. The primary reason for this parameter is to improve scaling when you’re single thread CPU bound. With numjob=1 FIO will use only a single thread and there’s only so much a single CPU core can do.
> Being 6 devices bought from 4 different sellers it’s impossible that they are all defective.
I was a little unclear on the disk cache part. It’s a setting, generally in the RAID controller / HBA. It’s also a filesystem level option in Linux (hdparm) and Windows (somewhere in device manager?). The reason to disable the disk cache is that it’s NOT protected against power loss protection on the MX300. So by disabling it you can ensure 100% write consistency at the cost of write performance. (using fully power protected drives will let you keep disk cache enabled)
> Why 64k and QD=4? I thought of 8k and larger QD. Will test as soon as possible and report here the results :)
It’s more representative of what you’ll see at the application level. (If you’ve got a running system, you can just use IOstat to see what your average QD is. (iostat -x 10, and it’s the column: avgqu-sz. Change from 10 seconds to whatever interval works best for your environment)
> Do you have some HBA card to suggest? What do you think of LSI SAS3008? I think it’s the same as the 3108 without RAID On Chip feature. Probably I will buy a Lenovo HBA card with that chip. It seems blazing fast (1mln IOPS) compared to the actual embedded RAID controller (LSI 2008).
I’ve been able to consistently get the same performance out of any of the LSI based cards. The 3008 and 3108 both work great, regardless of vendor. Just test or read up on the different configuration parameters (read ahead, write back vs write through, disk cache)
Wes Vaske
Senior Storage Solutions Engineer
Micron Technology
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Pietro Pugni
Sent: Tuesday, February 21, 2017 5:44 PM
To: Wes Vaske (wvaske) <wvaske@micron.com>
Cc: Merlin Moncure <mmoncure@gmail.com>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Suggestions for a HBA controller (6 x SSDs + madam RAID10)
Disclaimer: I’ve done extensive testing (FIO and postgres) with a few different RAID controllers and HW RAID vs mdadm. We (micron) are crucial but I don’t personally work with the consumer drives.
Verify whether you have your disk write cache enabled or disabled. If it’s disabled, that will have a large impact on write performance.
What an honor :)
My SSDs are Crucial MX300 (consumer drives) but, as previously stated, they gave ~90k IOPS in all benchmarks I found on the web, while mine tops at ~40k IOPS. Being 6 devices bought from 4 different sellers it’s impossible that they are all defective.
Is this the *exact* string you used? `fio --filename=/dev/sdx --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=4ktest`
With FIO, you need to multiply iodepth by numjobs to get the final queue depth its pushing. (in this case, 256). Make sure you’re looking at the correct data.
I used —numjobs=1 because I needed the time series values for bandwidth, latencies and iops. The command string was the same, except from varying IO Depth and numjobs=1.
Few other things:
- Mdadm will give better performance than HW RAID for specific benchmarks.
- Performance is NOT linear with drive count for synthetic benchmarks.
- It is often nearly linear for application performance.
mdadm RAID10 scaled linearly while mdadm RAID0 scaled much less.
- HW RAID can give better performance if your drives do not have a capacitor backed cache (like the MX300) AND the controller has a battery backed cache. *Consumer drives can often get better performance from HW RAID*. (otherwise MDADM has been faster in all of my testing)
My RAID controller doesn’t have a BBU.
- Mdadm RAID10 has a bug where reads are not properly distributed between the mirror pairs. (It uses head position calculated from the last IO to determine which drive in a mirror pair should get the next read. It results in really weird behavior of most read IO going to half of your drives instead of being evenly split as should be the case for SSDs). You can see this by running iostat while you’ve got a load running and you’ll see uneven distribution of IOs. FYI, the RAID1 implementation has an exception where it does NOT use head position for SSDs. I have yet to test this but you should be able to get better performance by manually striping a RAID0 across multiple RAID1s instead of using the default RAID10 implementation.
Very interesting. I will double check this after buying and mounting the new HBA. I heard of someone doing what you are suggesting but never tried.
- Don’t focus on 4k Random Read. Do something more similar to a PG workload (64k 70/30 R/W @ QD=4 is *reasonably* close to what I see for heavy OLTP).
Why 64k and QD=4? I thought of 8k and larger QD. Will test as soon as possible and report here the results :)
I’ve tested multiple controllers based on the LSI 3108 and found that default settings from one vendor to another provide drastically different performance profiles. Vendor A had much better benchmark performance (2x IOPS of B) while vendor B gave better application performance (20% better OLTP performance in Postgres). (I got equivalent performance from A & B when using the same settings).
Do you have some HBA card to suggest? What do you think of LSI SAS3008? I think it’s the same as the 3108 without RAID On Chip feature. Probably I will buy a Lenovo HBA card with that chip. It seems blazing fast (1mln IOPS) compared to the actual embedded RAID controller (LSI 2008).
I don’t know if I can connect a 12Gb/s HBA directly to my existing 6Gb/s expander/backplane.. sure I will have the right cables but don’t know if it will work without changing the expander/backplane.
Thank you a lot for your time
Pietro Pugni
pgsql-performance by date: