Thread: PostgreSQL block size for SSD RAID setup?
Hi,
Express yourself instantly with MSN Messenger! MSN Messenger
I was reading a benchmark that sets out block sizes against raw IO performance for a number of different RAID configurations involving high end SSDs (the Mtron 7535) on a powerful RAID controller (the Areca 1680IX with 4GB RAM). See http://jdevelopment.nl/hardware/one-dvd-per-second/
From the figures being given it seems to be the case that a 16KB block size is the ideal size. Namely, in the graphs its clear that a high amount of IOPS (60000) is maintained until the 16KB block size, but drops sharply after that. MB/sec however still increases until a block size of ~128KB. I would say that the sweet spot is therefor either 16KB (if you emphasis many IOPS) or something between 16KB and 128KB if you want to optimize for both a large number of IOPS and a large number of MB/sec. It seems to me that MB/sec is less important for most database operations, but since we're talking about random IO MB/sec might still be an important figure.
PostgreSQL however defaults to using a block size of 8KB. From the observations made in the benchmark this seems to be a less than optimal size (at least for such an SSD setup). The block size in PG only seems to be changeable by means of a recompile, so it's not something for a quick test.
Nevertheless, the numbers given in the benchmark intrigue me and I wonder if anyone has already tried setting PG's block size to 16KB for such a setup as used in the SSD benchmark.
Thanks in advance for all help,
Henk
Express yourself instantly with MSN Messenger! MSN Messenger
henk de wit <henk53602@hotmail.com> writes: > Hi, > I was reading a benchmark that sets out block sizes against raw IO performance > for a number of different RAID configurations involving high end SSDs (the > Mtron 7535) on a powerful RAID controller (the Areca 1680IX with 4GB RAM). See > http://jdevelopment.nl/hardware/one-dvd-per-second/ You might also be interested in: http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/ http://thunk.org/tytso/blog/2009/02/22/should-filesystems-be-optimized-for-ssds/ It seems you have to do more work than just look at the application. You want the application, the filesystem, the partition layout, and the raid device geometry to all consistently maintain alignment with erase blocks. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
>You might also be interested in:
>
> http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/
>
> http://thunk.org/tytso/blog/2009/02/22/should-filesystems-be-optimized-for-ssds/
See all the ways you can stay connected to friends and family
>
> http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/
>
> http://thunk.org/tytso/blog/2009/02/22/should-filesystems-be-optimized-for-ssds/
Thanks a lot for the pointers. I'll definitely check these out.
> It seems you have to do more work than just look at the application. You want
> the application, the filesystem, the partition layout, and the raid device
> geometry to all consistently maintain alignment with erase blocks.
> the application, the filesystem, the partition layout, and the raid device
> geometry to all consistently maintain alignment with erase blocks.
So it seems. PG is just a factor in this equation, but nevertheless an important one.
See all the ways you can stay connected to friends and family
> Hi, > I was reading a benchmark that sets out block sizes against raw IO > performance for a number of different RAID configurations involving high > end SSDs (the Mtron 7535) on a powerful RAID controller (the Areca > 1680IX with 4GB RAM). See > http://jdevelopment.nl/hardware/one-dvd-per-second/ Lucky guys ;) Something that bothers me about SSDs is the interface... The latest flash chips from Micron (32Gb = 4GB per chip) have something like 25 us "access time" (lol) and push data at 166 MB/s (yes megabytes per second) per chip. So two of these chips are enough to bottleneck a SATA 3Gbps link... there would be 8 of those chips in a 32GB SSD. Parallelizing would depend on the block size : putting all chips in parallel would increase the block size, so in practice I don't know how it's implemented, probably depends on the make and model of SSD. And then RAIDing those (to get back the lost throughput from using SATA) will again increase the block size which is bad for random writes. So it's a bit of a chicken and egg problem. Also since harddisks have high throughput but slow seeks, all the OS'es and RAID cards, drivers, etc are probably optimized for throughput, not IOPS. You need a very different strategy for 100K/s 8kbyte IOs versus 1K/s 1MByte IOs. Like huge queues, smarter hardware, etc. FusionIO got an interesting product by using the PCI-e interface which brings lots of benefits like much higher throughput and the possibility of using custom drivers optimized for handling much more IO requests per second than what the OS and RAID cards, and even SATA protocol, were designed for. Intrigued by this I looked at the FusionIO benchmarks : more than 100.000 IOPS, really mindboggling, but in random access over a 10MB file. A little bit of google image search reveals the board contains a lot of Flash chips (expected) and a fat FPGA (expected) probably a high-end chip from X or A, and two DDR RAM chips from Samsung, probably acting as cache. So I wonder if the 10 MB file used as benchmark to reach those humongous IOPS was actually in the Flash ?... or did they actually benchmark the device's onboard cache ?... It probably has writeback cache so on a random writes benchmark this is an interesting question. A good RAID card with BBU cache would have the same benchmarking gotcha (ie if you go crazy on random writes on a 10 MB file which is very small, and the device is smart, possibly at the end of the benchmark nothing at all was written to the disks !) Anyway in a database use case if random writes are going to be a pain they are probably not going to be distributed in a tiny 10MB zone which the controller cache would handle... (just rambling XDD)
Most benchmarks and reviews out there are very ignorant on SSD design. I suggest you start by reading some white papers and presentations on the research side that are public:
(pdf) http://research.microsoft.com/pubs/63596/USENIX-08-SSD.pdf
(html) http://www.usenix.org/events/usenix08/tech/full_papers/agrawal/agrawal_html/index.html
Pdf presentation power point style: http://institute.lanl.gov/hec-fsio/workshops/2008/presentations/day3/Prabhakaran-Panel-SSD.pdf
Benchmarks by EasyCo (software layer that does what the hardware should if your ssd’s controller stinks):
http://www.storagesearch.com/easyco-flashperformance-art.pdf
On 2/25/09 10:28 AM, "PFC" <lists@peufeu.com> wrote:
That ram is the working space cache for the LBA -> physical block remapping. When a request comes in for a read, looking up what physical block contains the LBA would take a long time if it was going through the flash (its the block that claims to be mapped that way, with the highest transaction number — or some other similar algorithm). The lookup table is cached in RAM. The wear leveling and other tasks need working set memory to operate as well.
The solution (see white papers above) is more over-provisioning of flash. This can be achieved manually by making sure that more of the LBA’s are NEVER ever written to — partition just 75% of the drive and leave the last 25% untouched, then there will be that much more extra to work with which makes even insanely crazy continuous random writes over the whole space perform at very high iops with low latency. This is only necessary for particular loads, and all flash devices over-provision to some extent. I’m pretty sure that the Intel X25-M, which provides 80GB to the user, has at least 100GB of actual flash in there — perhaps 120GB. That overprovision may be internal to the actual flash chip, since Intel makes both the chip and controller. There is absolutely extra ECC and block metadata in there (this is not new, again, see the whitepaper).
The X25-E certainly is over-provisioned.
In the future, there are two things that will help flash a lot:
*File systems that avoid writing to a region as long as possible, preferring to write to areas previously freed at some point.
*New OS block device semantics. Currently its ‘read’ and ‘write’. The device, once all LBA’s have had a write to them once, is always “100%” full. A ‘deallocate’ command would help SSD random writes, wear leveling, and write amplification algorithms significantly.
(pdf) http://research.microsoft.com/pubs/63596/USENIX-08-SSD.pdf
(html) http://www.usenix.org/events/usenix08/tech/full_papers/agrawal/agrawal_html/index.html
Pdf presentation power point style: http://institute.lanl.gov/hec-fsio/workshops/2008/presentations/day3/Prabhakaran-Panel-SSD.pdf
Benchmarks by EasyCo (software layer that does what the hardware should if your ssd’s controller stinks):
http://www.storagesearch.com/easyco-flashperformance-art.pdf
On 2/25/09 10:28 AM, "PFC" <lists@peufeu.com> wrote:
> Hi,No, you would need at least 10 to 12 of those chips for such a SSD (that does good wear leveling), since overprovisioning is required for wear leveling and write amplification factor.
> I was reading a benchmark that sets out block sizes against raw IO
> performance for a number of different RAID configurations involving high
> end SSDs (the Mtron 7535) on a powerful RAID controller (the Areca
> 1680IX with 4GB RAM). See
> http://jdevelopment.nl/hardware/one-dvd-per-second/
Lucky guys ;)
Something that bothers me about SSDs is the interface... The latest flash
chips from Micron (32Gb = 4GB per chip) have something like 25 us "access
time" (lol) and push data at 166 MB/s (yes megabytes per second) per chip.
So two of these chips are enough to bottleneck a SATA 3Gbps link... there
would be 8 of those chips in a 32GB SSD. Parallelizing would depend on the
block size : putting all chips in parallel would increase the block size,
so in practice I don't know how it's implemented, probably depends on the
make and model of SSD.
With cheap low end SSD’s that don’t deal with random writes properly, and can’t remap LBA ‘s to physical blocks in small chunks, and raid stripes smaller than erase blocks, yes. But for SSD’s you want large RAID block sizes, no raid 5, without pre-loading the whole block on a small read. This is since random access inside one block is fast, unlike hard drives.
And then RAIDing those (to get back the lost throughput from using SATA)
will again increase the block size which is bad for random writes. So it's
a bit of a chicken and egg problem.
Yes. I get better performance with software raid 10, multiple plain SAS adapters, and SSD’s than any raid card I’ve tried because the raid card can’t keep up with the i/o’s and tries to do a lot of scheduling work. Furthermore, a Battery backed memory caching card is forced to prioritize writes at the expense of reads, which causes problems when you want to keep read latency low during a large batch write. Throw the same requests at a good SSD, and it works (90% of them are bad schedulers and with concurrent read/write at the moment though).
Also since harddisks have high
throughput but slow seeks, all the OS'es and RAID cards, drivers, etc are
probably optimized for throughput, not IOPS. You need a very different
strategy for 100K/s 8kbyte IOs versus 1K/s 1MByte IOs. Like huge queues,
smarter hardware, etc.
Intel’s SSD, and AFAIK FusionIO’s device, do not cache writes in RAM (a tiny bit is buffered in SRAM, 256K=erase block size, on the intel controller; unknown in FusionIO’s FPGA).
FusionIO got an interesting product by using the PCI-e interface which
brings lots of benefits like much higher throughput and the possibility of
using custom drivers optimized for handling much more IO requests per
second than what the OS and RAID cards, and even SATA protocol, were
designed for.
Intrigued by this I looked at the FusionIO benchmarks : more than 100.000
IOPS, really mindboggling, but in random access over a 10MB file. A little
bit of google image search reveals the board contains a lot of Flash chips
(expected) and a fat FPGA (expected) probably a high-end chip from X or A,
and two DDR RAM chips from Samsung, probably acting as cache. So I wonder
if the 10 MB file used as benchmark to reach those humongous IOPS was
actually in the Flash ?... or did they actually benchmark the device's
onboard cache ?...
That ram is the working space cache for the LBA -> physical block remapping. When a request comes in for a read, looking up what physical block contains the LBA would take a long time if it was going through the flash (its the block that claims to be mapped that way, with the highest transaction number — or some other similar algorithm). The lookup table is cached in RAM. The wear leveling and other tasks need working set memory to operate as well.
The numbers are slower, but not as dramatic as you would expect, for a 10GB file. Its clearly not a writeback cache.
It probably has writeback cache so on a random writes benchmark this is
an interesting question. A good RAID card with BBU cache would have the
same benchmarking gotcha (ie if you go crazy on random writes on a 10 MB
file which is very small, and the device is smart, possibly at the end of
the benchmark nothing at all was written to the disks !)
Certain write load mixes can fragment the LBA > physical block map and make wear leveling and write amplification reduction expensive and slow things down. This effect is usually temporary and highly workload dependant.
Anyway in a database use case if random writes are going to be a pain
they are probably not going to be distributed in a tiny 10MB zone which
the controller cache would handle...
(just rambling XDD)
The solution (see white papers above) is more over-provisioning of flash. This can be achieved manually by making sure that more of the LBA’s are NEVER ever written to — partition just 75% of the drive and leave the last 25% untouched, then there will be that much more extra to work with which makes even insanely crazy continuous random writes over the whole space perform at very high iops with low latency. This is only necessary for particular loads, and all flash devices over-provision to some extent. I’m pretty sure that the Intel X25-M, which provides 80GB to the user, has at least 100GB of actual flash in there — perhaps 120GB. That overprovision may be internal to the actual flash chip, since Intel makes both the chip and controller. There is absolutely extra ECC and block metadata in there (this is not new, again, see the whitepaper).
The X25-E certainly is over-provisioned.
In the future, there are two things that will help flash a lot:
*File systems that avoid writing to a region as long as possible, preferring to write to areas previously freed at some point.
*New OS block device semantics. Currently its ‘read’ and ‘write’. The device, once all LBA’s have had a write to them once, is always “100%” full. A ‘deallocate’ command would help SSD random writes, wear leveling, and write amplification algorithms significantly.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance