Thread: Testing Sandforce SSD
Hello list, Probably like many other's I've wondered why no SSD manufacturer puts a small BBU on a SSD drive. Triggered by Greg Smith's mail http://archives.postgresql.org/pgsql-performance/2010-02/msg00291.php here, and also anandtech's review at http://www.anandtech.com/show/2899/1 (see page 6 for pictures of the capacitor) I ordered a SandForce drive and this week it finally arrived. And now I have to test it and was wondering about some things like * How to test for power failure? I thought by running on the same machine a parallel pgbench setup on two clusters where one runs with data and wal on a rotating disk, the other on the SSD, both without BBU controller. Then turn off power. Do that a few times. The problem in this scenario is that even when the SSD would show not data loss and the rotating disk would for a few times, a dozen tests without failure isn't actually proof that the drive can write it's complete buffer to disk after power failure. * How long should the power be turned off? A minute? 15 minutes? * What filesystem to use on the SSD? To minimize writes and maximize chance for seeing errors I'd choose ext2 here. For the sake of not comparing apples with pears I'd have to go with ext2 on the rotating data disk as well. Do you guys have any more ideas to properly 'feel this disk at its teeth' ? regards, Yeb Havinga
> Do you guys have any more ideas to properly 'feel this disk at its > teeth' ? While an 'end-to-end' test using PG is fine, I think it would be easier to determine if the drive is behaving correctly by using a simple test program that emulates the storage semantics the WAL expects. Have it write a constant stream of records, fsync'ing after each write. Record the highest record number flushed so far in some place that won't be lost with the drive under test (e.g. send it over the network to another machine). Kill the power, bring the system back up again and examine what's at the tail end of that file. I think this will give you the worst case test with the easiest result discrimination. If you want to you could add concurrent random writes to another file for extra realism. Someone here may already have a suitable test program. I know I've written several over the years in order to test I/O performance, prove the existence of kernel bugs, and so on. I doubt it matters much how long the power is turned of. A second should be plenty time to flush pending writes if the drive is going to do so.
On Sat, 24 Jul 2010, David Boreham wrote: >> Do you guys have any more ideas to properly 'feel this disk at its teeth' ? > > While an 'end-to-end' test using PG is fine, I think it would be easier to > determine if the drive is behaving correctly by using a simple test program > that emulates the storage semantics the WAL expects. Have it write a constant > stream of records, fsync'ing after each write. Record the highest record > number flushed so far in some place that won't be lost with the drive under > test (e.g. send it over the network to another machine). > > Kill the power, bring the system back up again and examine what's at the tail > end of that file. I think this will give you the worst case test with the > easiest result discrimination. > > If you want to you could add concurrent random writes to another file for > extra realism. > > Someone here may already have a suitable test program. I know I've written > several over the years in order to test I/O performance, prove the existence > of kernel bugs, and so on. > > I doubt it matters much how long the power is turned of. A second should be > plenty time to flush pending writes if the drive is going to do so. remember that SATA is designed to be hot-plugged, so you don't have to kill the entire system to kill power to the drive. this is a little more ubrupt than the system loosing power, but in terms of loosing data this is about the worst case (while at the same time, it eliminates the possibility that the OS continues to perform writes to the drive as power dies, which is a completely different class of problems, independant of the drive type) David Lang
On Jul 24, 2010, at 12:20 AM, Yeb Havinga wrote: > The problem in this scenario is that even when the SSD would show not data loss and the rotating disk would for a few times,a dozen tests without failure isn't actually proof that the drive can write it's complete buffer to disk after powerfailure. Yes, this is always going to be the case with testing like this - you'll never be able to prove that it will always be safe.
Yeb Havinga wrote: > Probably like many other's I've wondered why no SSD manufacturer puts > a small BBU on a SSD drive. Triggered by Greg Smith's mail > http://archives.postgresql.org/pgsql-performance/2010-02/msg00291.php > here, and also anandtech's review at > http://www.anandtech.com/show/2899/1 (see page 6 for pictures of the > capacitor) I ordered a SandForce drive and this week it finally arrived. Note that not all of the Sandforce drives include a capacitor; I hope you got one that does! I wasn't aware any of the SF drives with a capacitor on them were even shipping yet, all of the ones I'd seen were the chipset that doesn't include one still. Haven't checked in a few weeks though. > * How to test for power failure? I've had good results using one of the early programs used to investigate this class of problems: http://brad.livejournal.com/2116715.html?page=2 You really need a second "witness" server to do this sort of thing reliably, which that provides. > * What filesystem to use on the SSD? To minimize writes and maximize > chance for seeing errors I'd choose ext2 here. I don't consider there to be any reason to deploy any part of a PostgreSQL database on ext2. The potential for downtime if the fsck doesn't happen automatically far outweighs the minimal performance advantage you'll actually see in real applications. All of the benchmarks showing large gains for ext2 over ext3 I have seen been synthetic, not real database performance; the internal ones I've run using things like pgbench do not show a significant improvement. (Yes, I'm already working on finding time to publicly release those findings) Put it on ext3, toggle on noatime, and move on to testing. The overhead of the metadata writes is the least of the problems when doing write-heavy stuff on Linux. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Sat, Jul 24, 2010 at 3:20 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: > Hello list, > > Probably like many other's I've wondered why no SSD manufacturer puts a > small BBU on a SSD drive. Triggered by Greg Smith's mail > http://archives.postgresql.org/pgsql-performance/2010-02/msg00291.php here, > and also anandtech's review at http://www.anandtech.com/show/2899/1 (see > page 6 for pictures of the capacitor) I ordered a SandForce drive and this > week it finally arrived. > > And now I have to test it and was wondering about some things like > > * How to test for power failure? I test like this: write a small program that sends a endless series of inserts like this: *) on the server: create table foo (id serial); *) from the client: insert into foo default values returning id; on the client side print the inserted value to the terminal after the query is reported as complete to the client. Run the program, wait a bit, then pull the plug on the server. The database should recover clean and the last reported insert on the client should be there when it restarts. Try restarting immediately a few times then if that works try it and let it simmer overnight. If it makes it at least 24-48 hours that's a very promising sign. merlin
Greg Smith wrote: > Note that not all of the Sandforce drives include a capacitor; I hope > you got one that does! I wasn't aware any of the SF drives with a > capacitor on them were even shipping yet, all of the ones I'd seen > were the chipset that doesn't include one still. Haven't checked in a > few weeks though. I think I did, it was expensive enough, though while ordering its very easy to order the wrong one, all names on the product category page look the same. (OCZ Vertex 2 Pro) >> * How to test for power failure? > > I've had good results using one of the early programs used to > investigate this class of problems: > http://brad.livejournal.com/2116715.html?page=2 A great tool, thanks for the link! diskchecker: running 34 sec, 4.10% coverage of 500 MB (1342 writes; 39/s) diskchecker: running 35 sec, 4.24% coverage of 500 MB (1390 writes; 39/s) diskchecker: running 36 sec, 4.35% coverage of 500 MB (1427 writes; 39/s) diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; 39/s) didn't get 'ok' from server (11387 316950), msg=[] = Connection reset by peer at ./diskchecker.pl line 132. here's where I removed the power and left it off for about a minute. Then started again then did the verify yeb@a:~$ ./diskchecker.pl -s client45.eemnes verify test_file verifying: 0.00% Total errors: 0 :-) this was on ext2 >> * What filesystem to use on the SSD? To minimize writes and maximize >> chance for seeing errors I'd choose ext2 here. > > I don't consider there to be any reason to deploy any part of a > PostgreSQL database on ext2. The potential for downtime if the fsck > doesn't happen automatically far outweighs the minimal performance > advantage you'll actually see in real applications. Hmm.. wouldn't that apply for other filesystems as well? I know that JFS also won't mount if booted unclean, it somehow needs a marker from the fsck. Don't know for ext3, xfs etc. > All of the benchmarks showing large gains for ext2 over ext3 I have > seen been synthetic, not real database performance; the internal ones > I've run using things like pgbench do not show a significant > improvement. (Yes, I'm already working on finding time to publicly > release those findings) The reason I'd choose ext2 on the SSD was mainly to decrease the number of writes, not for performance. Maybe I should ultimately do tests for both journalled and ext2 filesystems and compare the amount of data per x pgbench transactions. > Put it on ext3, toggle on noatime, and move on to testing. The > overhead of the metadata writes is the least of the problems when > doing write-heavy stuff on Linux. Will surely do and post the results. thanks, Yeb Havinga
Yeb Havinga wrote: > diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; 39/s) > Total errors: 0 > > :-) OTOH, I now notice the 39 write /s .. If that means ~ 39 tps... bummer.
Greg Smith wrote: > Note that not all of the Sandforce drives include a capacitor; I hope > you got one that does! I wasn't aware any of the SF drives with a > capacitor on them were even shipping yet, all of the ones I'd seen > were the chipset that doesn't include one still. Haven't checked in a > few weeks though. Answer my own question here: the drive Yeb got was the brand spanking new OCZ Vertex 2 Pro, selling for $649 at Newegg for example: http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and with the supercacitor listed right in the main production specifications there. This is officially the first inexpensive (relatively) SSD with a battery-backed write cache built into it. If Yeb's test results prove it works as it's supposed to under PostgreSQL, I'll be happy to finally have a moderately priced SSD I can recommend to people for database use. And I fear I'll be out of excuses to avoid buying one as a toy for my home system. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Sat, 2010-07-24 at 16:21 -0400, Greg Smith wrote: > Greg Smith wrote: > > Note that not all of the Sandforce drives include a capacitor; I hope > > you got one that does! I wasn't aware any of the SF drives with a > > capacitor on them were even shipping yet, all of the ones I'd seen > > were the chipset that doesn't include one still. Haven't checked in a > > few weeks though. > > Answer my own question here: the drive Yeb got was the brand spanking > new OCZ Vertex 2 Pro, selling for $649 at Newegg for example: > http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and with > the supercacitor listed right in the main production specifications > there. This is officially the first inexpensive (relatively) SSD with a > battery-backed write cache built into it. If Yeb's test results prove > it works as it's supposed to under PostgreSQL, I'll be happy to finally > have a moderately priced SSD I can recommend to people for database > use. And I fear I'll be out of excuses to avoid buying one as a toy for > my home system. That is quite the toy. I can get 4 SATA-II with RAID Controller, with battery backed cache, for the same price or less :P Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
Joshua D. Drake wrote: > That is quite the toy. I can get 4 SATA-II with RAID Controller, with > battery backed cache, for the same price or less :P > True, but if you look at tests like http://www.anandtech.com/show/2899/12 it suggests there's probably at least a 6:1 performance speedup for workloads with a lot of random I/O to them. And I'm really getting sick of the power/noise/heat that the 6 drives in my home server produces. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Yeb Havinga wrote: > Yeb Havinga wrote: >> diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; >> 39/s) >> Total errors: 0 >> >> :-) > OTOH, I now notice the 39 write /s .. If that means ~ 39 tps... bummer. When playing with it a bit more, I couldn't get the test_file to be created in the right place on the test system. It turns out I had the diskchecker config switched and 39 write/s was the speed of the not-rebooted system, sorry. I did several diskchecker.pl tests this time with the testfile on the SSD, none of the tests have returned an error :-) Writes/s start low but quickly converge to a number in the range of 1200 to 1800. The writes diskchecker does are 16kB writes. Making this 4kB writes does not increase writes/s. 32kB seems a little less, 64kB is about two third of initial writes/s and 128kB is half. So no BBU speeds here for writes, but still ~ factor 10 improvement of iops for a rotating SATA disk. regards, Yeb Havinga PS: hdparm showed write cache was on. I did tests with both ext2 and xfs, where xfs tests I did with both barrier and nobarrier.
Yeb Havinga wrote: > Writes/s start low but quickly converge to a number in the range of > 1200 to 1800. The writes diskchecker does are 16kB writes. Making this > 4kB writes does not increase writes/s. 32kB seems a little less, 64kB > is about two third of initial writes/s and 128kB is half. Let's turn that into MB/s numbers: 4k * 1200 = 4.7 MB/s 8k * 1200 = 9.4 MB/s 16k * 1200 = 18.75 MB/s 64kb * 1200 * 2/3 [800] = 37.5 MB/s 128kb * 1200 / 2 [600] = 75 MB/s For comparison sake, a 7200 RPM drive running PostgreSQL will do <120 commits/second without a BBWC, so at an 8K block size that's <1 MB/s. If you put a cache in the middle, I'm used to seeing about 5000 8K commits/second, which is around 40 MB/s. So this is sitting right in the middle of those two. Sequential writes with a commit after each one like this are basically the worst case for the SSD, so if it can provide reasonable performance on that I'd be happy. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Sat, 2010-07-24 at 16:21 -0400, Greg Smith wrote: > Greg Smith wrote: > > Note that not all of the Sandforce drives include a capacitor; I hope > > you got one that does! I wasn't aware any of the SF drives with a > > capacitor on them were even shipping yet, all of the ones I'd seen > > were the chipset that doesn't include one still. Haven't checked in a > > few weeks though. > > Answer my own question here: the drive Yeb got was the brand spanking > new OCZ Vertex 2 Pro, selling for $649 at Newegg for example: > http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and with > the supercacitor listed right in the main production specifications > there. This is officially the first inexpensive (relatively) SSD with a > battery-backed write cache built into it. If Yeb's test results prove > it works as it's supposed to under PostgreSQL, I'll be happy to finally > have a moderately priced SSD I can recommend to people for database > use. And I fear I'll be out of excuses to avoid buying one as a toy for > my home system. That is quite the toy. I can get 4 SATA-II with RAID Controller, with battery backed cache, for the same price or less :P Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
Greg Smith wrote: > Put it on ext3, toggle on noatime, and move on to testing. The > overhead of the metadata writes is the least of the problems when > doing write-heavy stuff on Linux. I ran a pgbench run and power failure test during pgbench with a 3 year old computer 8GB DDR ? Intel Core 2 duo 6600 @ 2.40GHz Intel Corporation 82801IB (ICH9) 2 port SATA IDE Controller 64 bit 2.6.31-22-server (Ubuntu karmic), kernel option elevator=deadline sysctl options besides increasing shm: fs.file-max=327679 fs.aio-max-nr=3145728 vm.swappiness=0 vm.dirty_background_ratio = 3 vm.dirty_expire_centisecs = 500 vm.dirty_writeback_centisecs = 100 vm.dirty_ratio = 15 Filesystem on SSD with postgresql data: ext3 mounted with noatime,nodiratime,relatime Postgresql cluster: did initdb with C locale. Data and pg_xlog together on the same ext3 filesystem. Changed in postgresql.conf: settings with pgtune for OLTP and 15 connections maintenance_work_mem = 480MB # pgtune wizard 2010-07-25 checkpoint_completion_target = 0.9 # pgtune wizard 2010-07-25 effective_cache_size = 5632MB # pgtune wizard 2010-07-25 work_mem = 512MB # pgtune wizard 2010-07-25 wal_buffers = 8MB # pgtune wizard 2010-07-25 checkpoint_segments = 31 # pgtune said 16 here shared_buffers = 1920MB # pgtune wizard 2010-07-25 max_connections = 15 # pgtune wizard 2010-07-25 Initialized with scale 800 with is about 12GB. I especially went beyond an in RAM size for this machine (that would be ~ 5GB), so random reads would weigh in the result. Then let pgbench run the tcp benchmark with -M prepared, 10 clients and -T 3600 (one hour) and 10 clients, after that loaded the logfile in a db and did some queries. Then realized the pgbench result page was not in screen buffer anymore so I cannot copy it here, but hey, those can be edited as well right ;-) select count(*),count(*)/3600,avg(time),stddev(time) from log; count | ?column? | avg | stddev ---------+----------+-----------------------+---------------- 4939212 | 1372 | 7282.8581978258880161 | 11253.96967962 (1 row) Judging from the latencys in the logfiles I did not experience serious lagging (time is in microseconds): select * from log order by time desc limit 3; client_id | tx_no | time | file_no | epoch | time_us -----------+-------+---------+---------+------------+--------- 3 | 33100 | 1229503 | 0 | 1280060345 | 866650 9 | 39990 | 1077519 | 0 | 1280060345 | 858702 2 | 55323 | 1071060 | 0 | 1280060519 | 750861 (3 rows) select * from log order by time desc limit 3 OFFSET 1000; client_id | tx_no | time | file_no | epoch | time_us -----------+--------+--------+---------+------------+--------- 5 | 262466 | 245953 | 0 | 1280062074 | 513789 1 | 267519 | 245867 | 0 | 1280062074 | 513301 7 | 273662 | 245532 | 0 | 1280062078 | 378932 (3 rows) select * from log order by time desc limit 3 OFFSET 10000; client_id | tx_no | time | file_no | epoch | time_us -----------+--------+-------+---------+------------+--------- 5 | 123011 | 82854 | 0 | 1280061036 | 743986 6 | 348967 | 82853 | 0 | 1280062687 | 776317 8 | 439789 | 82848 | 0 | 1280063109 | 552928 (3 rows) Then I started pgbench again with the same setting, let it run for a few minutes and in another console did CHECKPOINT and then turned off power. After restarting, the database recovered without a problem. LOG: database system was interrupted; last known up at 2010-07-25 10:14:15 EDT LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at F/98008610 LOG: record with zero length at F/A2BAC040 LOG: redo done at F/A2BAC010 LOG: last completed transaction was at log time 2010-07-25 10:14:16.151037-04 regards, Yeb Havinga
Yeb Havinga wrote: > > 8GB DDR2 something.. (lots of details removed) Graph of TPS at http://tinypic.com/r/b96aup/3 and latency at http://tinypic.com/r/x5e846/3 Thanks http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm for the gnuplot and psql scripts!
Yeb Havinga wrote: > Greg Smith wrote: >> Put it on ext3, toggle on noatime, and move on to testing. The >> overhead of the metadata writes is the least of the problems when >> doing write-heavy stuff on Linux. > I ran a pgbench run and power failure test during pgbench with a 3 > year old computer > On the same config more tests. scale 10 read only and read/write tests. note: only 240 s. starting vacuum...end. transaction type: SELECT only scaling factor: 10 query mode: prepared number of clients: 10 duration: 240 s number of transactions actually processed: 8208115 tps = 34197.109896 (including connections establishing) tps = 34200.658720 (excluding connections establishing) yeb@client45:~$ pgbench -c 10 -l -M prepared -T 240 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 10 query mode: prepared number of clients: 10 duration: 240 s number of transactions actually processed: 809271 tps = 3371.147020 (including connections establishing) tps = 3371.518611 (excluding connections establishing) ---------- scale 300 (just fits in RAM) read only and read/write tests pgbench -c 10 -M prepared -T 300 -S test starting vacuum...end. transaction type: SELECT only scaling factor: 300 query mode: prepared number of clients: 10 duration: 300 s number of transactions actually processed: 9219279 tps = 30726.931095 (including connections establishing) tps = 30729.692823 (excluding connections establishing) The test above doesn't really test the drive but shows the CPU/RAM limit. pgbench -c 10 -l -M prepared -T 3600 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 300 query mode: prepared number of clients: 10 duration: 3600 s number of transactions actually processed: 8838200 tps = 2454.994217 (including connections establishing) tps = 2455.012480 (excluding connections establishing) ------ scale 2000 pgbench -c 10 -M prepared -T 300 -S test starting vacuum...end. transaction type: SELECT only scaling factor: 2000 query mode: prepared number of clients: 10 duration: 300 s number of transactions actually processed: 755772 tps = 2518.547576 (including connections establishing) tps = 2518.762476 (excluding connections establishing) So the test above tests the random seek performance. Iostat on the drive showed a steady just over 4000 read io's/s: avg-cpu: %user %nice %system %iowait %steal %idle 11.39 0.00 13.37 60.40 0.00 14.85 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 4171.00 0.00 60624.00 0.00 29.07 11.81 2.83 0.24 100.00 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 pgbench -c 10 -l -M prepared -T 24000 test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 2000 query mode: prepared number of clients: 10 duration: 24000 s number of transactions actually processed: 30815691 tps = 1283.979098 (including connections establishing) tps = 1283.980446 (excluding connections establishing) Note the duration of several hours. No long waits occurred - of this last test the latency png is at http://yfrog.com/f/0vlatencywp/ and the TPS graph at http://yfrog.com/f/b5tpsp/ regards, Yeb Havinga
On Sun, 25 Jul 2010, Yeb Havinga wrote: > Graph of TPS at http://tinypic.com/r/b96aup/3 and latency at > http://tinypic.com/r/x5e846/3 Does your latency graph really have milliseconds as the y axis? If so, this device is really slow - some requests have a latency of more than a second! Matthew -- The early bird gets the worm, but the second mouse gets the cheese.
Matthew Wakeling wrote: > On Sun, 25 Jul 2010, Yeb Havinga wrote: >> Graph of TPS at http://tinypic.com/r/b96aup/3 and latency at >> http://tinypic.com/r/x5e846/3 > > Does your latency graph really have milliseconds as the y axis? Yes > If so, this device is really slow - some requests have a latency of > more than a second! I try to just give the facts. Please remember that particular graphs are from a read/write pgbench run on a bigger than RAM database that ran for some time (so with checkpoints), on a *single* $435 50GB drive without BBU raid controller. Also, this is a picture with a few million points: the ones above 200ms are perhaps a hundred and hence make up a very small fraction. So far I'm pretty impressed with this drive. Lets be fair to OCZ and the SandForce guys and do not shoot from the hip things like "really slow", without that being backed by a graphed pgbench run together with it's cost, so we can compare numbers with numbers. regards, Yeb Havinga
Matthew Wakeling wrote: > Does your latency graph really have milliseconds as the y axis? If so, > this device is really slow - some requests have a latency of more than > a second! Have you tried that yourself? If you generate one of those with standard hard drives and a BBWC under Linux, I expect you'll discover those latencies to be >5 seconds long. I recently saw >100 *seconds* running a large pgbench test due to latency flushing things to disk, on a system with 72GB of RAM. Takes a long time to flush >3GB of random I/O out to disk when the kernel will happily cache that many writes until checkpoint time. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Yeb Havinga wrote: > Please remember that particular graphs are from a read/write pgbench > run on a bigger than RAM database that ran for some time (so with > checkpoints), on a *single* $435 50GB drive without BBU raid controller. To get similar *average* performance results you'd need to put about 4 drives and a BBU into a server. The worst-case latency on that solution is pretty bad though, when a lot of random writes are queued up; I suspect that's where the SSD will look much better. By the way: if you want to run a lot more tests in an organized fashion, that's what http://github.com/gregs1104/pgbench-tools was written to do. That will spit out graphs by client and by scale showing how sensitive the test results are to each. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, 26 Jul 2010, Greg Smith wrote: > Matthew Wakeling wrote: >> Does your latency graph really have milliseconds as the y axis? If so, this >> device is really slow - some requests have a latency of more than a second! > > Have you tried that yourself? If you generate one of those with standard > hard drives and a BBWC under Linux, I expect you'll discover those latencies > to be >5 seconds long. I recently saw >100 *seconds* running a large pgbench > test due to latency flushing things to disk, on a system with 72GB of RAM. > Takes a long time to flush >3GB of random I/O out to disk when the kernel > will happily cache that many writes until checkpoint time. Apologies, I was interpreting the graph as the latency of the device, not all the layers in-between as well. There isn't any indication in the email with the graph as to what the test conditions or software are. Obviously if you factor in checkpoints and the OS writing out everything, then you would have to expect some large latency operations. However, if the device itself behaved as in the graph, I would be most unhappy and send it back. Yeb also made the point - there are far too many points on that graph to really tell what the average latency is. It'd be instructive to have a few figures, like "only x% of requests took longer than y". Matthew -- I wouldn't be so paranoid if you weren't all out to get me!!
Matthew Wakeling wrote: > Apologies, I was interpreting the graph as the latency of the device, > not all the layers in-between as well. There isn't any indication in > the email with the graph as to what the test conditions or software are. That info was in the email preceding the graph mail, but I see now I forgot to mention it was a 8.4.4 postgres version. regards, Yeb Havinga
On Mon, Jul 26, 2010 at 10:26 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:
Matthew Wakeling wrote:That info was in the email preceding the graph mail, but I see now I forgot to mention it was a 8.4.4 postgres version.Apologies, I was interpreting the graph as the latency of the device, not all the layers in-between as well. There isn't any indication in the email with the graph as to what the test conditions or software are.
Speaking of the layers in-between, has this test been done with the ext3 journal on a different device? Maybe the purpose is wrong for the SSD. Use the SSD for the ext3 journal and the spindled drives for filesystem? Another possibility is to use ext2 on the SSD.
Greg
Matthew Wakeling wrote: > Yeb also made the point - there are far too many points on that graph > to really tell what the average latency is. It'd be instructive to > have a few figures, like "only x% of requests took longer than y". Average latency is the inverse of TPS. So if the result is, say, 1200 TPS, that means the average latency is 1 / (1200 transactions/second) = 0.83 milliseconds/transaction. The average TPS figure is normally on a more useful scale as far as being able to compare them in ways that make sense to people. pgbench-tools derives average, worst-case, and 90th percentile figures for latency from the logs. I have 37MB worth of graphs from a system showing how all this typically works for regular hard drives I've been given permission to publish; just need to find a place to host it at internally and I'll make the whole stack available to the world. So far Yeb's data is showing that a single SSD is competitive with a small array on average, but with better worst-case behavior than I'm used to seeing. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Spiegelberg wrote: > Speaking of the layers in-between, has this test been done with the > ext3 journal on a different device? Maybe the purpose is wrong for > the SSD. Use the SSD for the ext3 journal and the spindled drives for > filesystem? The main disk bottleneck on PostgreSQL databases are the random seeks for reading and writing to the main data blocks. The journal information is practically noise in comparison--it barely matters because it's so much less difficult to keep up with. This is why I don't really find ext2 interesting either. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Smith <greg@2ndquadrant.com> wrote: > Yeb's data is showing that a single SSD is competitive with a > small array on average, but with better worst-case behavior than > I'm used to seeing. So, how long before someone benchmarks a small array of SSDs? :-) -Kevin
Greg Smith wrote: > Yeb Havinga wrote: >> Please remember that particular graphs are from a read/write pgbench >> run on a bigger than RAM database that ran for some time (so with >> checkpoints), on a *single* $435 50GB drive without BBU raid controller. > > To get similar *average* performance results you'd need to put about 4 > drives and a BBU into a server. The worst-case latency on that > solution is pretty bad though, when a lot of random writes are queued > up; I suspect that's where the SSD will look much better. > > By the way: if you want to run a lot more tests in an organized > fashion, that's what http://github.com/gregs1104/pgbench-tools was > written to do. That will spit out graphs by client and by scale > showing how sensitive the test results are to each. Got it, running the default config right now. When you say 'comparable to a small array' - could you give a ballpark figure for 'small'? regards, Yeb Havinga PS: Some update on the testing: I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory read/write test. (scale 300) No real winners or losers, though ext2 isn't really faster and the manual need for fix (y) during boot makes it impractical in its standard configuration. I did some poweroff tests with barriers explicitily off in ext3, ext4 and xfs, still all recoveries went ok.
Yeb Havinga wrote: >> To get similar *average* performance results you'd need to put about >> 4 drives and a BBU into a server. The > Please forget this question, I now see it in the mail i'm replying to. Sorry for the spam! -- Yeb
Yeb Havinga wrote: > I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory > read/write test. (scale 300) No real winners or losers, though ext2 > isn't really faster and the manual need for fix (y) during boot makes > it impractical in its standard configuration. That's what happens every time I try it too. The theoretical benefits of ext2 for hosting PostgreSQL just don't translate into significant performance increases on database oriented tests, certainly not ones that would justify the downside of having fsck issues come back again. Glad to see that holds true on this hardware too. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, Jul 26, 2010 at 12:40 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Greg Spiegelberg wrote: >> >> Speaking of the layers in-between, has this test been done with the ext3 >> journal on a different device? Maybe the purpose is wrong for the SSD. Use >> the SSD for the ext3 journal and the spindled drives for filesystem? > > The main disk bottleneck on PostgreSQL databases are the random seeks for > reading and writing to the main data blocks. The journal information is > practically noise in comparison--it barely matters because it's so much less > difficult to keep up with. This is why I don't really find ext2 interesting > either. Note that SSDs aren't usually real fast at large sequential writes though, so it might be worth putting pg_xlog on a spinning pair in a mirror and seeing how much, if any, the SSD drive speeds up when not having to do pg_xlog.
On Mon, Jul 26, 2010 at 1:45 PM, Greg Smith <greg@2ndquadrant.com> wrote:
Yeb Havinga wrote:That's what happens every time I try it too. The theoretical benefits of ext2 for hosting PostgreSQL just don't translate into significant performance increases on database oriented tests, certainly not ones that would justify the downside of having fsck issues come back again. Glad to see that holds true on this hardware too.I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory read/write test. (scale 300) No real winners or losers, though ext2 isn't really faster and the manual need for fix (y) during boot makes it impractical in its standard configuration.
I know I'm talking development now but is there a case for a pg_xlog block device to remove the file system overhead and guaranteeing your data is written sequentially every time?
Greg
On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote: > On Mon, Jul 26, 2010 at 1:45 PM, Greg Smith <greg@2ndquadrant.com> wrote: > > Yeb Havinga wrote: > >> I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory > >> read/write test. (scale 300) No real winners or losers, though ext2 isn't > >> really faster and the manual need for fix (y) during boot makes it > >> impractical in its standard configuration. > >> > > > > That's what happens every time I try it too. The theoretical benefits of > > ext2 for hosting PostgreSQL just don't translate into significant > > performance increases on database oriented tests, certainly not ones that > > would justify the downside of having fsck issues come back again. Glad to > > see that holds true on this hardware too. > I know I'm talking development now but is there a case for a pg_xlog block > device to remove the file system overhead and guaranteeing your data is > written sequentially every time? For one I doubt that its a relevant enough efficiency loss in comparison with a significantly significantly complex implementation (for one you cant grow/shrink, for another you have to do more complex, hw-dependent things like rounding to hardware boundaries, page size etc to stay efficient) for another my experience is that at a relatively low point XlogInsert gets to be the bottleneck - so I don't see much point in improving at that low level (yet at least). Where I would like to do some hw dependent measuring (because I see significant improvements there) would be prefetching for seqscan, indexscans et al. using blktrace... But I currently dont have the time. And its another topic ;-) Andres
Greg Spiegelberg wrote: > I know I'm talking development now but is there a case for a pg_xlog > block device to remove the file system overhead and guaranteeing your > data is written sequentially every time? It's possible to set the PostgreSQL wal_sync_method parameter in the database to open_datasync or open_sync, and if you have an operating system that supports direct writes it will use those and bypass things like the OS write cache. That's close to what you're suggesting, supposedly portable, and it does show some significant benefit when it's properly supported. Problem has been, the synchronous writing code on Linux in particular hasn't ever worked right against ext3, and the PostgreSQL code doesn't make the right call at all on Solaris. So there's two popular platforms that it just plain doesn't work on, even though it should. We've gotten reports that there are bleeding edge Linux kernel and library versions available now that finally fix that issue, and that PostgreSQL automatically takes advantage of them when it's compiled on one of them. But I'm not aware of any distribution that makes this easy to try out that's available yet, paint is still wet on the code I think. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, 2010-07-26 at 14:34 -0400, Greg Smith wrote: > Matthew Wakeling wrote: > > Yeb also made the point - there are far too many points on that graph > > to really tell what the average latency is. It'd be instructive to > > have a few figures, like "only x% of requests took longer than y". > > Average latency is the inverse of TPS. So if the result is, say, 1200 > TPS, that means the average latency is 1 / (1200 transactions/second) = > 0.83 milliseconds/transaction. This is probably only true if you run all transactions sequentially in one connection? If you run 10 parallel threads and get 1200 sec, the average transaction time (latency?) is probably closer to 8.3 ms ? > The average TPS figure is normally on a > more useful scale as far as being able to compare them in ways that make > sense to people. > > pgbench-tools derives average, worst-case, and 90th percentile figures > for latency from the logs. I have 37MB worth of graphs from a system > showing how all this typically works for regular hard drives I've been > given permission to publish; just need to find a place to host it at > internally and I'll make the whole stack available to the world. So far > Yeb's data is showing that a single SSD is competitive with a small > array on average, but with better worst-case behavior than I'm used to > seeing. > > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us > > -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On Mon, Jul 26, 2010 at 01:47:14PM -0600, Scott Marlowe wrote: >Note that SSDs aren't usually real fast at large sequential writes >though, so it might be worth putting pg_xlog on a spinning pair in a >mirror and seeing how much, if any, the SSD drive speeds up when not >having to do pg_xlog. xlog is also where I use ext2; it does bench faster for me in that config, and the fsck issues don't really exist because you're not in a situation with a lot of files being created/removed. Mike Stone
On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote: >I know I'm talking development now but is there a case for a pg_xlog block >device to remove the file system overhead and guaranteeing your data is >written sequentially every time? If you dedicate a partition to xlog, you already get that in practice with no extra devlopment. Mike Stone
Michael Stone wrote: > On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote: >> I know I'm talking development now but is there a case for a pg_xlog >> block >> device to remove the file system overhead and guaranteeing your data is >> written sequentially every time? > > If you dedicate a partition to xlog, you already get that in practice > with no extra devlopment. Due to the LBA remapping of the SSD, I'm not sure of putting files that are sequentially written in a different partition (together with e.g. tables) would make a difference: in the end the SSD will have a set new blocks in it's buffer and somehow arrange them into sets of 128KB of 256KB writes for the flash chips. See also http://www.anandtech.com/show/2899/2 But I ran out of ideas to test, so I'm going to test it anyway. regards, Yeb Havinga
Yeb Havinga wrote: > Michael Stone wrote: >> On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote: >>> I know I'm talking development now but is there a case for a pg_xlog >>> block >>> device to remove the file system overhead and guaranteeing your data is >>> written sequentially every time? >> >> If you dedicate a partition to xlog, you already get that in practice >> with no extra devlopment. > Due to the LBA remapping of the SSD, I'm not sure of putting files > that are sequentially written in a different partition (together with > e.g. tables) would make a difference: in the end the SSD will have a > set new blocks in it's buffer and somehow arrange them into sets of > 128KB of 256KB writes for the flash chips. See also > http://www.anandtech.com/show/2899/2 > > But I ran out of ideas to test, so I'm going to test it anyway. Same machine config as mentioned before, with data and xlog on separate partitions, ext3 with barrier off (save on this SSD). pgbench -c 10 -M prepared -T 3600 -l test starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 300 query mode: prepared number of clients: 10 duration: 3600 s number of transactions actually processed: 10856359 tps = 3015.560252 (including connections establishing) tps = 3015.575739 (excluding connections establishing) This is about 25% faster than data and xlog combined on the same filesystem. Below is output from iostat -xk 1 -p /dev/sda, which shows each second per partition statistics. sda2 is data, sda3 is xlog In the third second a checkpoint seems to start. avg-cpu: %user %nice %system %iowait %steal %idle 63.50 0.00 30.50 2.50 0.00 3.50 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 6518.00 36.00 2211.00 148.00 35524.00 31.75 0.28 0.12 0.11 25.00 sda1 0.00 2.00 0.00 5.00 0.00 636.00 254.40 0.03 6.00 2.00 1.00 sda2 0.00 218.00 36.00 40.00 148.00 1032.00 31.05 0.00 0.00 0.00 0.00 sda3 0.00 6298.00 0.00 2166.00 0.00 33856.00 31.26 0.25 0.12 0.12 25.00 avg-cpu: %user %nice %system %iowait %steal %idle 60.50 0.00 37.50 0.50 0.00 1.50 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 6514.00 33.00 2283.00 140.00 35188.00 30.51 0.32 0.14 0.13 29.00 sda1 0.00 0.00 0.00 3.00 0.00 12.00 8.00 0.00 0.00 0.00 0.00 sda2 0.00 0.00 33.00 2.00 140.00 8.00 8.46 0.03 0.86 0.29 1.00 sda3 0.00 6514.00 0.00 2278.00 0.00 35168.00 30.88 0.29 0.13 0.13 29.00 avg-cpu: %user %nice %system %iowait %steal %idle 33.00 0.00 34.00 18.00 0.00 15.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 3782.00 7.00 7235.00 28.00 44068.00 12.18 69.52 9.46 0.09 62.00 sda1 0.00 0.00 0.00 1.00 0.00 4.00 8.00 0.00 0.00 0.00 0.00 sda2 0.00 322.00 7.00 6018.00 28.00 25360.00 8.43 69.22 11.33 0.08 47.00 sda3 0.00 3460.00 0.00 1222.00 0.00 18728.00 30.65 0.30 0.25 0.25 30.00 avg-cpu: %user %nice %system %iowait %steal %idle 9.00 0.00 36.00 22.50 0.00 32.50 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 1079.00 3.00 11110.00 12.00 49060.00 8.83 120.64 10.95 0.08 86.00 sda1 0.00 2.00 0.00 2.00 0.00 320.00 320.00 0.12 60.00 35.00 7.00 sda2 0.00 30.00 3.00 10739.00 12.00 43076.00 8.02 120.49 11.30 0.08 83.00 sda3 0.00 1047.00 0.00 363.00 0.00 5640.00 31.07 0.03 0.08 0.08 3.00 avg-cpu: %user %nice %system %iowait %steal %idle 62.00 0.00 31.00 2.00 0.00 5.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 6267.00 51.00 2493.00 208.00 35040.00 27.71 1.80 0.71 0.12 31.00 sda1 0.00 0.00 0.00 3.00 0.00 12.00 8.00 0.00 0.00 0.00 0.00 sda2 0.00 123.00 51.00 344.00 208.00 1868.00 10.51 1.50 3.80 0.10 4.00 sda3 0.00 6144.00 0.00 2146.00 0.00 33160.00 30.90 0.30 0.14 0.14 30.00
On Wed, Jul 28, 2010 at 9:18 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:
The trick may be in kjournald for which there is 1 for each ext3 journalled file system. I learned back in Red Hat 4 pre U4 kernels there was a problem with kjournald that would either cause 30 second hangs or lock up my server completely when pg_xlog and data were on the same file system plus a few other "right" things going on.
Given the multicore world we have today, I think it makes sense that multiple ext3 file systems, and the kjournald's that service them, is faster than a single combined file system.
Greg
Yeb Havinga wrote:Same machine config as mentioned before, with data and xlog on separate partitions, ext3 with barrier off (save on this SSD).Due to the LBA remapping of the SSD, I'm not sure of putting files that are sequentially written in a different partition (together with e.g. tables) would make a difference: in the end the SSD will have a set new blocks in it's buffer and somehow arrange them into sets of 128KB of 256KB writes for the flash chips. See also http://www.anandtech.com/show/2899/2
But I ran out of ideas to test, so I'm going to test it anyway.
pgbench -c 10 -M prepared -T 3600 -l test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 10
duration: 3600 s
number of transactions actually processed: 10856359
tps = 3015.560252 (including connections establishing)
tps = 3015.575739 (excluding connections establishing)
This is about 25% faster than data and xlog combined on the same filesystem.
The trick may be in kjournald for which there is 1 for each ext3 journalled file system. I learned back in Red Hat 4 pre U4 kernels there was a problem with kjournald that would either cause 30 second hangs or lock up my server completely when pg_xlog and data were on the same file system plus a few other "right" things going on.
Given the multicore world we have today, I think it makes sense that multiple ext3 file systems, and the kjournald's that service them, is faster than a single combined file system.
Greg
On Wed, Jul 28, 2010 at 03:45:23PM +0200, Yeb Havinga wrote: >Due to the LBA remapping of the SSD, I'm not sure of putting files >that are sequentially written in a different partition (together with >e.g. tables) would make a difference: in the end the SSD will have a >set new blocks in it's buffer and somehow arrange them into sets of >128KB of 256KB writes for the flash chips. See also >http://www.anandtech.com/show/2899/2 It's not a question of the hardware side, it's the software. The xlog needs to by synchronized, and the things the filesystem has to do to make that happen penalize the non-xlog disk activity. That's why my preferred config is xlog on ext2, rest on xfs. That allows the synchronous activity to happen with minimal overhead, while the parts that benefit from having more data in flight can do that freely. Mike Stone
Greg Smith wrote: > Greg Smith wrote: >> Note that not all of the Sandforce drives include a capacitor; I hope >> you got one that does! I wasn't aware any of the SF drives with a >> capacitor on them were even shipping yet, all of the ones I'd seen >> were the chipset that doesn't include one still. Haven't checked in >> a few weeks though. > > Answer my own question here: the drive Yeb got was the brand spanking > new OCZ Vertex 2 Pro, selling for $649 at Newegg for example: > http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and > with the supercacitor listed right in the main production > specifications there. This is officially the first inexpensive > (relatively) SSD with a battery-backed write cache built into it. If > Yeb's test results prove it works as it's supposed to under > PostgreSQL, I'll be happy to finally have a moderately priced SSD I > can recommend to people for database use. And I fear I'll be out of > excuses to avoid buying one as a toy for my home system. > Hello list, After a week testing I think I can answer the question above: does it work like it's supposed to under PostgreSQL? YES The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro, http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534 * it is safe to mount filesystems with barrier off, since it has a 'supercap backed cache'. That data is not lost is confirmed by a dozen power switch off tests while running either diskchecker.pl or pgbench. * the above implies its also safe to use this SSD with barriers, though that will perform less, since this drive obeys write trough commands. * the highest pgbench tps number for the TPC-B test for a scale 300 database (~5GB) I could get was over 6700. Judging from the iostat average util of ~40% on the xlog partition, I believe that this number is limited by other factors than the SSD, like CPU, core count, core MHz, memory size/speed, 8.4 pgbench without threads. Unfortunately I don't have a faster/more core machines available for testing right now. * pgbench numbers for a larger than RAM database, read only was over 25000 tps (details are at the end of this post), during which iostat reported ~18500 read iops and 100% utilization. * pgbench max reported latencies are 20% of comparable BBWC setups. * how reliable it is over time, and how it performs over time I cannot say, since I tested it only for a week. regards, Yeb Havinga PS: ofcourse all claims I make here are without any warranty. All information in this mail is for reference purposes, I do not claim it is suitable for your database setup. Some info on configuration: BOOT_IMAGE=/boot/vmlinuz-2.6.32-22-server elevator=deadline quad core AMD Phenom(tm) II X4 940 Processor on 3.0GHz 16GB RAM 667MHz DDR2 Disk/ filesystem settings. Model Family: OCZ Vertex SSD Device Model: OCZ VERTEX2-PRO Firmware Version: 1.10 hdparm: did not change standard settings: write cache is on, as well as readahead. hdparm -AW /dev/sdc /dev/sdc: look-ahead = 1 (on) write-caching = 1 (on) Untuned ext4 filesystem. Mount options /dev/sdc2 on /data type ext4 (rw,noatime,nodiratime,relatime,barrier=0,discard) /dev/sdc3 on /xlog type ext4 (rw,noatime,nodiratime,relatime,barrier=0,discard) Note the -o discard: this means use of the automatic SSD trimming on a new linux kernel. Also, per core per filesystem there now is a [ext4-dio-unwrit] process - which suggest something like 'directio'? I haven't investigated this any further. Sysctl: (copied from a larger RAM database machine) kernel.core_uses_pid = 1 fs.file-max = 327679 net.ipv4.ip_local_port_range = 1024 65000 kernel.msgmni = 2878 kernel.msgmax = 8192 kernel.msgmnb = 65536 kernel.sem = 250 32000 100 142 kernel.shmmni = 4096 kernel.sysrq = 1 kernel.shmmax = 33794121728 kernel.shmall = 16777216 net.core.rmem_default = 262144 net.core.rmem_max = 2097152 net.core.wmem_default = 262144 net.core.wmem_max = 262144 fs.aio-max-nr = 3145728 vm.swappiness = 0 vm.dirty_background_ratio = 3 vm.dirty_expire_centisecs = 500 vm.dirty_writeback_centisecs = 100 vm.dirty_ratio = 15 Postgres settings: 8.4.4 --with-blocksize=4 I saw about 10% increase in performance compared to 8KB blocksizes. Postgresql.conf: changed from default config are: maintenance_work_mem = 480MB # pgtune wizard 2010-07-25 checkpoint_completion_target = 0.9 # pgtune wizard 2010-07-25 effective_cache_size = 5632MB # pgtune wizard 2010-07-25 work_mem = 512MB # pgtune wizard 2010-07-25 wal_buffers = 8MB # pgtune wizard 2010-07-25 checkpoint_segments = 128 # pgtune said 16 here shared_buffers = 1920MB # pgtune wizard 2010-07-25 max_connections = 100 initdb with data on sda2 and xlog on sda3, C locale Read write test on ~5GB database: $ pgbench -v -c 20 -M prepared -T 3600 test starting vacuum...end. starting vacuum pgbench_accounts...end. transaction type: TPC-B (sort of) scaling factor: 300 query mode: prepared number of clients: 20 duration: 3600 s number of transactions actually processed: 24291875 tps = 6747.665859 (including connections establishing) tps = 6747.721665 (excluding connections establishing) Read only test on larger than RAM ~23GB database (server has 16GB fysical RAM) : $ pgbench -c 20 -M prepared -T 300 -S test starting vacuum...end. transaction type: SELECT only *scaling factor: 1500* query mode: prepared number of clients: 20 duration: 300 s number of transactions actually processed: 7556469 tps = 25184.056498 (including connections establishing) tps = 25186.336911 (excluding connections establishing) IOstat reports ~18500 reads/s and ~185 read MB/s during this read only test on the data partition with 100% util.
6700tps?! Wow...... Ok, I'm impressed. May wait a bit for prices to come somewhat, but that sounds like two of those are going in one of my production machines (Raid 1, of course) Yeb Havinga wrote: > Greg Smith wrote: >> Greg Smith wrote: >>> Note that not all of the Sandforce drives include a capacitor; I >>> hope you got one that does! I wasn't aware any of the SF drives >>> with a capacitor on them were even shipping yet, all of the ones I'd >>> seen were the chipset that doesn't include one still. Haven't >>> checked in a few weeks though. >> >> Answer my own question here: the drive Yeb got was the brand >> spanking new OCZ Vertex 2 Pro, selling for $649 at Newegg for >> example: >> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and >> with the supercacitor listed right in the main production >> specifications there. This is officially the first inexpensive >> (relatively) SSD with a battery-backed write cache built into it. If >> Yeb's test results prove it works as it's supposed to under >> PostgreSQL, I'll be happy to finally have a moderately priced SSD I >> can recommend to people for database use. And I fear I'll be out of >> excuses to avoid buying one as a toy for my home system. >> > Hello list, > > After a week testing I think I can answer the question above: does it > work like it's supposed to under PostgreSQL? > > YES > > The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro, > http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534 > > * it is safe to mount filesystems with barrier off, since it has a > 'supercap backed cache'. That data is not lost is confirmed by a dozen > power switch off tests while running either diskchecker.pl or pgbench. > * the above implies its also safe to use this SSD with barriers, > though that will perform less, since this drive obeys write trough > commands. > * the highest pgbench tps number for the TPC-B test for a scale 300 > database (~5GB) I could get was over 6700. Judging from the iostat > average util of ~40% on the xlog partition, I believe that this number > is limited by other factors than the SSD, like CPU, core count, core > MHz, memory size/speed, 8.4 pgbench without threads. Unfortunately I > don't have a faster/more core machines available for testing right now. > * pgbench numbers for a larger than RAM database, read only was over > 25000 tps (details are at the end of this post), during which iostat > reported ~18500 read iops and 100% utilization. > * pgbench max reported latencies are 20% of comparable BBWC setups. > * how reliable it is over time, and how it performs over time I cannot > say, since I tested it only for a week. > > regards, > Yeb Havinga > > PS: ofcourse all claims I make here are without any warranty. All > information in this mail is for reference purposes, I do not claim it > is suitable for your database setup. > > Some info on configuration: > BOOT_IMAGE=/boot/vmlinuz-2.6.32-22-server elevator=deadline > quad core AMD Phenom(tm) II X4 940 Processor on 3.0GHz > 16GB RAM 667MHz DDR2 > > Disk/ filesystem settings. > Model Family: OCZ Vertex SSD > Device Model: OCZ VERTEX2-PRO > Firmware Version: 1.10 > > hdparm: did not change standard settings: write cache is on, as well > as readahead. > hdparm -AW /dev/sdc > /dev/sdc: > look-ahead = 1 (on) > write-caching = 1 (on) > > Untuned ext4 filesystem. > Mount options > /dev/sdc2 on /data type ext4 > (rw,noatime,nodiratime,relatime,barrier=0,discard) > /dev/sdc3 on /xlog type ext4 > (rw,noatime,nodiratime,relatime,barrier=0,discard) > Note the -o discard: this means use of the automatic SSD trimming on a > new linux kernel. > Also, per core per filesystem there now is a [ext4-dio-unwrit] process > - which suggest something like 'directio'? I haven't investigated this > any further. > > Sysctl: > (copied from a larger RAM database machine) > kernel.core_uses_pid = 1 > fs.file-max = 327679 > net.ipv4.ip_local_port_range = 1024 65000 > kernel.msgmni = 2878 > kernel.msgmax = 8192 > kernel.msgmnb = 65536 > kernel.sem = 250 32000 100 142 > kernel.shmmni = 4096 > kernel.sysrq = 1 > kernel.shmmax = 33794121728 > kernel.shmall = 16777216 > net.core.rmem_default = 262144 > net.core.rmem_max = 2097152 > net.core.wmem_default = 262144 > net.core.wmem_max = 262144 > fs.aio-max-nr = 3145728 > vm.swappiness = 0 > vm.dirty_background_ratio = 3 > vm.dirty_expire_centisecs = 500 > vm.dirty_writeback_centisecs = 100 > vm.dirty_ratio = 15 > > Postgres settings: > 8.4.4 > --with-blocksize=4 > I saw about 10% increase in performance compared to 8KB blocksizes. > > Postgresql.conf: > changed from default config are: > maintenance_work_mem = 480MB # pgtune wizard 2010-07-25 > checkpoint_completion_target = 0.9 # pgtune wizard 2010-07-25 > effective_cache_size = 5632MB # pgtune wizard 2010-07-25 > work_mem = 512MB # pgtune wizard 2010-07-25 > wal_buffers = 8MB # pgtune wizard 2010-07-25 > checkpoint_segments = 128 # pgtune said 16 here > shared_buffers = 1920MB # pgtune wizard 2010-07-25 > max_connections = 100 > > initdb with data on sda2 and xlog on sda3, C locale > > Read write test on ~5GB database: > $ pgbench -v -c 20 -M prepared -T 3600 test > starting vacuum...end. > starting vacuum pgbench_accounts...end. > transaction type: TPC-B (sort of) > scaling factor: 300 > query mode: prepared > number of clients: 20 > duration: 3600 s > number of transactions actually processed: 24291875 > tps = 6747.665859 (including connections establishing) > tps = 6747.721665 (excluding connections establishing) > > Read only test on larger than RAM ~23GB database (server has 16GB > fysical RAM) : > $ pgbench -c 20 -M prepared -T 300 -S test > starting vacuum...end. > transaction type: SELECT only > *scaling factor: 1500* > query mode: prepared > number of clients: 20 > duration: 300 s > number of transactions actually processed: 7556469 > tps = 25184.056498 (including connections establishing) > tps = 25186.336911 (excluding connections establishing) > > IOstat reports ~18500 reads/s and ~185 read MB/s during this read only > test on the data partition with 100% util. > >
Attachment
On Fri, Jul 30, 2010 at 11:01 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: > After a week testing I think I can answer the question above: does it work > like it's supposed to under PostgreSQL? > > YES > > The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro, > http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534 > > * it is safe to mount filesystems with barrier off, since it has a 'supercap > backed cache'. That data is not lost is confirmed by a dozen power switch > off tests while running either diskchecker.pl or pgbench. > * the above implies its also safe to use this SSD with barriers, though that > will perform less, since this drive obeys write trough commands. > * the highest pgbench tps number for the TPC-B test for a scale 300 database > (~5GB) I could get was over 6700. Judging from the iostat average util of > ~40% on the xlog partition, I believe that this number is limited by other > factors than the SSD, like CPU, core count, core MHz, memory size/speed, 8.4 > pgbench without threads. Unfortunately I don't have a faster/more core > machines available for testing right now. > * pgbench numbers for a larger than RAM database, read only was over 25000 > tps (details are at the end of this post), during which iostat reported > ~18500 read iops and 100% utilization. > * pgbench max reported latencies are 20% of comparable BBWC setups. > * how reliable it is over time, and how it performs over time I cannot say, > since I tested it only for a week. Thank you very much for posting this analysis. This has IMNSHO the potential to be a game changer. There are still some unanswered questions in terms of how the drive wears, reliability, errors, and lifespan but 6700 tps off of a single 400$ device with decent fault tolerance is amazing (Intel, consider yourself upstaged). Ever since the first samsung SSD hit the market I've felt the days of the spinning disk have been numbered. Being able to build a 100k tps server on relatively inexpensive hardware without an entire rack full of drives is starting to look within reach. > Postgres settings: > 8.4.4 > --with-blocksize=4 > I saw about 10% increase in performance compared to 8KB blocksizes. That's very interesting -- we need more testing in that department... regards (and thanks again) merlin
Merlin Moncure wrote: > On Fri, Jul 30, 2010 at 11:01 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: > >> Postgres settings: >> 8.4.4 >> --with-blocksize=4 >> I saw about 10% increase in performance compared to 8KB blocksizes. >> > > That's very interesting -- we need more testing in that department... > Definately - that 10% number was on the old-first hardware (the core 2 E6600). After reading my post and the 185MBps with 18500 reads/s number I was a bit suspicious whether I did the tests on the new hardware with 4K, because 185MBps / 18500 reads/s is ~10KB / read, so I thought thats a lot closer to 8KB than 4KB. I checked with show block_size and it was 4K. Then I redid the tests on the new server with the default 8KB blocksize and got about 4700 tps (TPC-B/300)... 67/47 = 1.47. So it seems that on newer hardware, the difference is larger than 10%. regards, Yeb Havinga
> Definately - that 10% number was on the old-first hardware (the core 2 > E6600). After reading my post and the 185MBps with 18500 reads/s number > I was a bit suspicious whether I did the tests on the new hardware with > 4K, because 185MBps / 18500 reads/s is ~10KB / read, so I thought thats > a lot closer to 8KB than 4KB. I checked with show block_size and it was > 4K. Then I redid the tests on the new server with the default 8KB > blocksize and got about 4700 tps (TPC-B/300)... 67/47 = 1.47. So it > seems that on newer hardware, the difference is larger than 10%. That doesn't make much sense unless there's some special advantage to a 4K blocksize with the hardware itself. Can you just do a basic filesystem test (like Bonnie++) with a 4K vs. 8K blocksize? Also, are you running your pgbench tests more than once, just to account for randomizing? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus wrote: > That doesn't make much sense unless there's some special advantage to a > 4K blocksize with the hardware itself. Given that pgbench is always doing tiny updates to blocks, I wouldn't be surprised if switching to smaller blocks helps it in a lot of situations if one went looking for them. Also, as you point out, pgbench runtime varies around wildly enough that 10% would need more investigation to really prove that means something. But I think Yeb has done plenty of investigation into the most interesting part here, the durability claims. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, Aug 2, 2010 at 6:07 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Josh Berkus wrote: >> >> That doesn't make much sense unless there's some special advantage to a >> 4K blocksize with the hardware itself. > > Given that pgbench is always doing tiny updates to blocks, I wouldn't be > surprised if switching to smaller blocks helps it in a lot of situations if > one went looking for them. Also, as you point out, pgbench runtime varies > around wildly enough that 10% would need more investigation to really prove > that means something. But I think Yeb has done plenty of investigation into > the most interesting part here, the durability claims. Running the tests for longer helps a lot on reducing the noisy results. Also letting them runs longer means that the background writer and autovacuum start getting involved, so the test becomes somewhat more realistic.
Scott Marlowe wrote: > On Mon, Aug 2, 2010 at 6:07 PM, Greg Smith <greg@2ndquadrant.com> wrote: > >> Josh Berkus wrote: >> >>> That doesn't make much sense unless there's some special advantage to a >>> 4K blocksize with the hardware itself. >>> >> Given that pgbench is always doing tiny updates to blocks, I wouldn't be >> surprised if switching to smaller blocks helps it in a lot of situations if >> one went looking for them. Also, as you point out, pgbench runtime varies >> around wildly enough that 10% would need more investigation to really prove >> that means something. But I think Yeb has done plenty of investigation into >> the most interesting part here, the durability claims. >> Please note that the 10% was on a slower CPU. On a more recent CPU the difference was 47%, based on tests that ran for an hour. That's why I absolutely agree with Merlin Moncure that more testing in this department is welcome, preferably by others since after all I could be on the pay roll of OCZ :-) I looked a bit into Bonnie++ but fail to see how I could do a test that somehow matches the PostgreSQL setup during the pgbench tests (db that fits in memory, so the test is actually how fast the ssd can capture sequential WAL writes and fsync without barriers, mixed with an occasional checkpoint with random write IO on another partition). Since the WAL writing is the same for both block_size setups, I decided to compare random writes to a file of 5GB with Oracle's Orion tool: === 4K test summary ==== ORION VERSION 11.1.0.7.0 Commandline: -testname test -run oltp -size_small 4 -size_large 1024 -write 100 This maps to this test: Test: test Small IO size: 4 KB Large IO size: 1024 KB IO Types: Small Random IOs, Large Random IOs Simulated Array Type: CONCAT Write: 100% Cache Size: Not Entered Duration for each Data Point: 60 seconds Small Columns:, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 Large Columns:, 0 Total Data Points: 21 Name: /mnt/data/5gb Size: 5242880000 1 FILEs found. Maximum Small IOPS=86883 @ Small=8 and Large=0 Minimum Small Latency=0.01 @ Small=1 and Large=0 === 8K test summary ==== ORION VERSION 11.1.0.7.0 Commandline: -testname test -run oltp -size_small 8 -size_large 1024 -write 100 This maps to this test: Test: test Small IO size: 8 KB Large IO size: 1024 KB IO Types: Small Random IOs, Large Random IOs Simulated Array Type: CONCAT Write: 100% Cache Size: Not Entered Duration for each Data Point: 60 seconds Small Columns:, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 Large Columns:, 0 Total Data Points: 21 Name: /mnt/data/5gb Size: 5242880000 1 FILEs found. Maximum Small IOPS=48798 @ Small=11 and Large=0 Minimum Small Latency=0.02 @ Small=1 and Large=0 > Running the tests for longer helps a lot on reducing the noisy > results. Also letting them runs longer means that the background > writer and autovacuum start getting involved, so the test becomes > somewhat more realistic. > Yes, that's why I did a lot of the TPC-B tests with -T 3600 so they'd run for an hour. (also the 4K vs 8K blocksize in postgres). regards, Yeb Havinga
Hannu Krosing wrote: > Did it fit in shared_buffers, or system cache ? > Database was ~5GB, server has 16GB, shared buffers was set to 1920MB. > I first noticed this several years ago, when doing a COPY to a large > table with indexes took noticably longer (2-3 times longer) when the > indexes were in system cache than when they were in shared_buffers. > I read this as a hint: try increasing shared_buffers. I'll redo the pgbench run with increased shared_buffers. >> so the test is actually how fast the ssd can capture >> sequential WAL writes and fsync without barriers, mixed with an >> occasional checkpoint with random write IO on another partition). Since >> the WAL writing is the same for both block_size setups, I decided to >> compare random writes to a file of 5GB with Oracle's Orion tool: >> > > Are you sure that you are not writing full WAL pages ? > I'm not sure I understand this question. > Do you have any stats on how much WAL is written for 8kb and 4kb test > cases ? > Would some iostat -xk 1 for each partition suffice? > And for other disk i/o during the tests ? > Not existent. regards, Yeb Havinga
Yeb Havinga wrote: > Small IO size: 4 KB > Maximum Small IOPS=86883 @ Small=8 and Large=0 > > Small IO size: 8 KB > Maximum Small IOPS=48798 @ Small=11 and Large=0 Conclusion: you can write 4KB blocks almost twice as fast as 8KB ones. This is a useful observation about the effectiveness of the write cache on the unit, but not really a surprise. On ideal hardware performance should double if you halve the write size. I already wagered the difference in pgbench results is caused by the same math. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Yeb Havinga wrote: > Hannu Krosing wrote: >> Did it fit in shared_buffers, or system cache ? >> > Database was ~5GB, server has 16GB, shared buffers was set to 1920MB. >> I first noticed this several years ago, when doing a COPY to a large >> table with indexes took noticably longer (2-3 times longer) when the >> indexes were in system cache than when they were in shared_buffers. >> > I read this as a hint: try increasing shared_buffers. I'll redo the > pgbench run with increased shared_buffers. Shared buffers raised from 1920MB to 3520MB: pgbench -v -l -c 20 -M prepared -T 1800 test starting vacuum...end. starting vacuum pgbench_accounts...end. transaction type: TPC-B (sort of) scaling factor: 300 query mode: prepared number of clients: 20 duration: 1800 s number of transactions actually processed: 12971714 tps = 7206.244065 (including connections establishing) tps = 7206.349947 (excluding connections establishing) :-)
On Tue, Aug 3, 2010 at 11:37 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: > Yeb Havinga wrote: >> >> Hannu Krosing wrote: >>> >>> Did it fit in shared_buffers, or system cache ? >>> >> >> Database was ~5GB, server has 16GB, shared buffers was set to 1920MB. >>> >>> I first noticed this several years ago, when doing a COPY to a large >>> table with indexes took noticably longer (2-3 times longer) when the >>> indexes were in system cache than when they were in shared_buffers. >>> >> >> I read this as a hint: try increasing shared_buffers. I'll redo the >> pgbench run with increased shared_buffers. > > Shared buffers raised from 1920MB to 3520MB: > > pgbench -v -l -c 20 -M prepared -T 1800 test > starting vacuum...end. > starting vacuum pgbench_accounts...end. > transaction type: TPC-B (sort of) > scaling factor: 300 > query mode: prepared > number of clients: 20 > duration: 1800 s > number of transactions actually processed: 12971714 > tps = 7206.244065 (including connections establishing) > tps = 7206.349947 (excluding connections establishing) > > :-) 1) what can we comparing this against (changing only the shared_buffers setting)? 2) I've heard that some SSD have utilities that you can use to query the write cycles in order to estimate lifespan. Does this one, and is it possible to publish the output (an approximation of the amount of work along with this would be wonderful)? merlin
On Tue, 2010-08-03 at 10:40 +0200, Yeb Havinga wrote: > se note that the 10% was on a slower CPU. On a more recent CPU the > difference was 47%, based on tests that ran for an hour. I am not surprised at all that reading and writing almost twice as much data from/to disk takes 47% longer. If less time is spent on seeking the amount of data starts playing bigger role. > That's why I > absolutely agree with Merlin Moncure that more testing in this > department is welcome, preferably by others since after all I could be > on the pay roll of OCZ :-) :) > I looked a bit into Bonnie++ but fail to see how I could do a test that > somehow matches the PostgreSQL setup during the pgbench tests (db that > fits in memory, Did it fit in shared_buffers, or system cache ? Once we are in high tps ground, the time it takes to move pages between userspace and system cache starts to play bigger role. I first noticed this several years ago, when doing a COPY to a large table with indexes took noticably longer (2-3 times longer) when the indexes were in system cache than when they were in shared_buffers. > so the test is actually how fast the ssd can capture > sequential WAL writes and fsync without barriers, mixed with an > occasional checkpoint with random write IO on another partition). Since > the WAL writing is the same for both block_size setups, I decided to > compare random writes to a file of 5GB with Oracle's Orion tool: Are you sure that you are not writing full WAL pages ? Do you have any stats on how much WAL is written for 8kb and 4kb test cases ? And for other disk i/o during the tests ? -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
On Jul 26, 2010, at 12:45 PM, Greg Smith wrote: > Yeb Havinga wrote: >> I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory >> read/write test. (scale 300) No real winners or losers, though ext2 >> isn't really faster and the manual need for fix (y) during boot makes >> it impractical in its standard configuration. > > That's what happens every time I try it too. The theoretical benefits > of ext2 for hosting PostgreSQL just don't translate into significant > performance increases on database oriented tests, certainly not ones > that would justify the downside of having fsck issues come back again. > Glad to see that holds true on this hardware too. > ext2 is slow for many reasons. ext4 with no journal is significantly faster than ext2. ext4 with a journal is faster thanext2. > -- > Greg Smith 2ndQuadrant US Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.us > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Aug 2, 2010, at 7:26 AM, Merlin Moncure wrote: > On Fri, Jul 30, 2010 at 11:01 AM, Yeb Havinga <yebhavinga@gmail.com> wrote: >> After a week testing I think I can answer the question above: does it work >> like it's supposed to under PostgreSQL? >> >> YES >> >> The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro, >> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534 >> >> * it is safe to mount filesystems with barrier off, since it has a 'supercap >> backed cache'. That data is not lost is confirmed by a dozen power switch >> off tests while running either diskchecker.pl or pgbench. >> * the above implies its also safe to use this SSD with barriers, though that >> will perform less, since this drive obeys write trough commands. >> * the highest pgbench tps number for the TPC-B test for a scale 300 database >> (~5GB) I could get was over 6700. Judging from the iostat average util of >> ~40% on the xlog partition, I believe that this number is limited by other >> factors than the SSD, like CPU, core count, core MHz, memory size/speed, 8.4 >> pgbench without threads. Unfortunately I don't have a faster/more core >> machines available for testing right now. >> * pgbench numbers for a larger than RAM database, read only was over 25000 >> tps (details are at the end of this post), during which iostat reported >> ~18500 read iops and 100% utilization. >> * pgbench max reported latencies are 20% of comparable BBWC setups. >> * how reliable it is over time, and how it performs over time I cannot say, >> since I tested it only for a week. > > Thank you very much for posting this analysis. This has IMNSHO the > potential to be a game changer. There are still some unanswered > questions in terms of how the drive wears, reliability, errors, and > lifespan but 6700 tps off of a single 400$ device with decent fault > tolerance is amazing (Intel, consider yourself upstaged). Ever since > the first samsung SSD hit the market I've felt the days of the > spinning disk have been numbered. Being able to build a 100k tps > server on relatively inexpensive hardware without an entire rack full > of drives is starting to look within reach. Intel's next gen 'enterprise' SSD's are due out later this year. I have heard from those with access to to test samplesthat they really like them -- these people rejected the previous versions because of the data loss on power failure. So, hopefully there will be some interesting competition later this year in the medium price range enterprise ssd market. > >> Postgres settings: >> 8.4.4 >> --with-blocksize=4 >> I saw about 10% increase in performance compared to 8KB blocksizes. > > That's very interesting -- we need more testing in that department... > > regards (and thanks again) > merlin > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Aug 3, 2010, at 9:27 AM, Merlin Moncure wrote: > > 2) I've heard that some SSD have utilities that you can use to query > the write cycles in order to estimate lifespan. Does this one, and is > it possible to publish the output (an approximation of the amount of > work along with this would be wonderful)? > On the intel drives, its available via SMART. Plenty of hits on how to read the data from google. Sandforce drives probablyhave it exposed via SMART as well. I have had over 50 X25-M's (80GB G1's) in production for 22 months that write ~100GB a day and SMART reports they have 78%of their write cycles left. Plus, when it dies from usage it supposedly enters a read-only state. (these only have recoverabledata so data loss on power failure is not a concern for me). So if Sandforce has low write amplification like Intel (they claim to be better) longevity should be fine. > merlin > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
greg@2ndquadrant.com (Greg Smith) writes: > Yeb Havinga wrote: >> * What filesystem to use on the SSD? To minimize writes and maximize >> chance for seeing errors I'd choose ext2 here. > > I don't consider there to be any reason to deploy any part of a > PostgreSQL database on ext2. The potential for downtime if the fsck > doesn't happen automatically far outweighs the minimal performance > advantage you'll actually see in real applications. Ah, but if the goal is to try to torture the SSD as cruelly as possible, these aren't necessarily downsides (important or otherwise). I don't think ext2 helps much in "maximizing chances of seeing errors" in notably useful ways, as the extra "torture" that takes place as part of the post-remount fsck isn't notably PG-relevant. (It's not obvious that errors encountered would be readily mapped to issues relating to PostgreSQL.) I think the WAL-oriented test would be *way* more useful; inducing work whose "brokenness" can be measured in one series of files in one directory should be way easier than trying to find changes across a whole PG cluster. I don't expect the filesystem choice to be terribly significant to that. -- "cbbrowne","@","gmail.com" "Heuristics (from the French heure, "hour") limit the amount of time spent executing something. [When using heuristics] it shouldn't take longer than an hour to do something."
jd@commandprompt.com ("Joshua D. Drake") writes: > On Sat, 2010-07-24 at 16:21 -0400, Greg Smith wrote: >> Greg Smith wrote: >> > Note that not all of the Sandforce drives include a capacitor; I hope >> > you got one that does! I wasn't aware any of the SF drives with a >> > capacitor on them were even shipping yet, all of the ones I'd seen >> > were the chipset that doesn't include one still. Haven't checked in a >> > few weeks though. >> >> Answer my own question here: the drive Yeb got was the brand spanking >> new OCZ Vertex 2 Pro, selling for $649 at Newegg for example: >> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and with >> the supercacitor listed right in the main production specifications >> there. This is officially the first inexpensive (relatively) SSD with a >> battery-backed write cache built into it. If Yeb's test results prove >> it works as it's supposed to under PostgreSQL, I'll be happy to finally >> have a moderately priced SSD I can recommend to people for database >> use. And I fear I'll be out of excuses to avoid buying one as a toy for >> my home system. > > That is quite the toy. I can get 4 SATA-II with RAID Controller, with > battery backed cache, for the same price or less :P Sure, but it: - Fits into a single slot - Is quiet - Consumes little power - Generates little heat - Is likely to be about as quick as the 4-drive array It doesn't have the extra 4TB of storage, but if you're building big-ish databases, metrics have to change anyways. This is a pretty slick answer for the small OLTP server. -- output = reverse("moc.liamg" "@" "enworbbc") http://linuxfinances.info/info/postgresql.html Chaotic Evil means never having to say you're sorry.
On 10-08-04 03:49 PM, Scott Carey wrote: > On Aug 2, 2010, at 7:26 AM, Merlin Moncure wrote: > >> On Fri, Jul 30, 2010 at 11:01 AM, Yeb Havinga<yebhavinga@gmail.com> wrote: >>> After a week testing I think I can answer the question above: does it work >>> like it's supposed to under PostgreSQL? >>> >>> YES >>> >>> The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro, >>> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534 >>> >>> * it is safe to mount filesystems with barrier off, since it has a 'supercap >>> backed cache'. That data is not lost is confirmed by a dozen power switch >>> off tests while running either diskchecker.pl or pgbench. >>> * the above implies its also safe to use this SSD with barriers, though that >>> will perform less, since this drive obeys write trough commands. >>> * the highest pgbench tps number for the TPC-B test for a scale 300 database >>> (~5GB) I could get was over 6700. Judging from the iostat average util of >>> ~40% on the xlog partition, I believe that this number is limited by other >>> factors than the SSD, like CPU, core count, core MHz, memory size/speed, 8.4 >>> pgbench without threads. Unfortunately I don't have a faster/more core >>> machines available for testing right now. >>> * pgbench numbers for a larger than RAM database, read only was over 25000 >>> tps (details are at the end of this post), during which iostat reported >>> ~18500 read iops and 100% utilization. >>> * pgbench max reported latencies are 20% of comparable BBWC setups. >>> * how reliable it is over time, and how it performs over time I cannot say, >>> since I tested it only for a week. >> Thank you very much for posting this analysis. This has IMNSHO the >> potential to be a game changer. There are still some unanswered >> questions in terms of how the drive wears, reliability, errors, and >> lifespan but 6700 tps off of a single 400$ device with decent fault >> tolerance is amazing (Intel, consider yourself upstaged). Ever since >> the first samsung SSD hit the market I've felt the days of the >> spinning disk have been numbered. Being able to build a 100k tps >> server on relatively inexpensive hardware without an entire rack full >> of drives is starting to look within reach. > Intel's next gen 'enterprise' SSD's are due out later this year. I have heard from those with access to to test samplesthat they really like them -- these people rejected the previous versions because of the data loss on power failure. > > So, hopefully there will be some interesting competition later this year in the medium price range enterprise ssd market. > I'll be doing some testing on Enterprise grade SSD's this year. I'll also be looking at some hybrid storage products that use as SSD's as accelerators mixed with lower cost storage. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Greg Smith wrote: > > * How to test for power failure? > > I've had good results using one of the early programs used to > investigate this class of problems: > http://brad.livejournal.com/2116715.html?page=2 FYI, this tool is mentioned in the Postgres documentation: http://www.postgresql.org/docs/9.0/static/wal-reliability.html -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +