Thread: Testing Sandforce SSD

Testing Sandforce SSD

From
Yeb Havinga
Date:
Hello list,

Probably like many other's I've wondered why no SSD manufacturer puts a
small BBU on a SSD drive. Triggered by Greg Smith's mail
http://archives.postgresql.org/pgsql-performance/2010-02/msg00291.php
here, and also anandtech's review at
http://www.anandtech.com/show/2899/1 (see page 6 for pictures of the
capacitor) I ordered a SandForce drive and this week it finally arrived.

And now I have to test it and was wondering about some things like

* How to test for power failure? I thought by running on the same
machine a parallel pgbench setup on two clusters where one runs with
data and wal on a rotating disk, the other on the SSD, both without BBU
controller. Then turn off power. Do that a few times. The problem in
this scenario is that even when the SSD would show not data loss and the
rotating disk would for a few times, a dozen tests without failure isn't
actually proof that the drive can write it's complete buffer to disk
after power failure.

* How long should the power be turned off? A minute? 15 minutes?

* What filesystem to use on the SSD? To minimize writes and maximize
chance for seeing errors I'd choose ext2 here. For the sake of not
comparing apples with pears I'd have to go with ext2 on the rotating
data disk as well.

Do you guys have any more ideas to properly 'feel this disk at its teeth' ?

regards,
Yeb Havinga


Re: Testing Sandforce SSD

From
David Boreham
Date:
> Do you guys have any more ideas to properly 'feel this disk at its
> teeth' ?

While an 'end-to-end' test using PG is fine, I think it would be easier
to determine if the drive is behaving correctly by using a simple test
program that emulates the storage semantics the WAL expects. Have it
write a constant stream of records, fsync'ing after each write. Record
the highest record number flushed so far in some place that won't be
lost with the drive under test (e.g. send it over the network to another
machine).

Kill the power, bring the system back up again and examine what's at the
tail end of that file. I think this will give you the worst case test
with the easiest result discrimination.

If you want to you could add concurrent random writes to another file
for extra realism.

Someone here may already have a suitable test program. I know I've
written several over the years in order to test I/O performance, prove
the existence of kernel bugs, and so on.

I doubt it matters much how long the power is turned of. A second should
be plenty time to flush pending writes if the drive is going to do so.



Re: Testing Sandforce SSD

From
david@lang.hm
Date:
On Sat, 24 Jul 2010, David Boreham wrote:

>> Do you guys have any more ideas to properly 'feel this disk at its teeth' ?
>
> While an 'end-to-end' test using PG is fine, I think it would be easier to
> determine if the drive is behaving correctly by using a simple test program
> that emulates the storage semantics the WAL expects. Have it write a constant
> stream of records, fsync'ing after each write. Record the highest record
> number flushed so far in some place that won't be lost with the drive under
> test (e.g. send it over the network to another machine).
>
> Kill the power, bring the system back up again and examine what's at the tail
> end of that file. I think this will give you the worst case test with the
> easiest result discrimination.
>
> If you want to you could add concurrent random writes to another file for
> extra realism.
>
> Someone here may already have a suitable test program. I know I've written
> several over the years in order to test I/O performance, prove the existence
> of kernel bugs, and so on.
>
> I doubt it matters much how long the power is turned of. A second should be
> plenty time to flush pending writes if the drive is going to do so.

remember that SATA is designed to be hot-plugged, so you don't have to
kill the entire system to kill power to the drive.

this is a little more ubrupt than the system loosing power, but in terms
of loosing data this is about the worst case (while at the same time, it
eliminates the possibility that the OS continues to perform writes to the
drive as power dies, which is a completely different class of problems,
independant of the drive type)

David Lang

Re: Testing Sandforce SSD

From
Ben Chobot
Date:
On Jul 24, 2010, at 12:20 AM, Yeb Havinga wrote:

> The problem in this scenario is that even when the SSD would show not data loss and the rotating disk would for a few
times,a dozen tests without failure isn't actually proof that the drive can write it's complete buffer to disk after
powerfailure. 

Yes, this is always going to be the case with testing like this - you'll never be able to prove that it will always be
safe. 

Re: Testing Sandforce SSD

From
Greg Smith
Date:
Yeb Havinga wrote:
> Probably like many other's I've wondered why no SSD manufacturer puts
> a small BBU on a SSD drive. Triggered by Greg Smith's mail
> http://archives.postgresql.org/pgsql-performance/2010-02/msg00291.php
> here, and also anandtech's review at
> http://www.anandtech.com/show/2899/1 (see page 6 for pictures of the
> capacitor) I ordered a SandForce drive and this week it finally arrived.

Note that not all of the Sandforce drives include a capacitor; I hope
you got one that does!  I wasn't aware any of the SF drives with a
capacitor on them were even shipping yet, all of the ones I'd seen were
the chipset that doesn't include one still.  Haven't checked in a few
weeks though.

> * How to test for power failure?

I've had good results using one of the early programs used to
investigate this class of problems:
http://brad.livejournal.com/2116715.html?page=2

You really need a second "witness" server to do this sort of thing
reliably, which that provides.

> * What filesystem to use on the SSD? To minimize writes and maximize
> chance for seeing errors I'd choose ext2 here.

I don't consider there to be any reason to deploy any part of a
PostgreSQL database on ext2.  The potential for downtime if the fsck
doesn't happen automatically far outweighs the minimal performance
advantage you'll actually see in real applications.  All of the
benchmarks showing large gains for ext2 over ext3 I have seen been
synthetic, not real database performance; the internal ones I've run
using things like pgbench do not show a significant improvement.  (Yes,
I'm already working on finding time to publicly release those findings)

Put it on ext3, toggle on noatime, and move on to testing.  The overhead
of the metadata writes is the least of the problems when doing
write-heavy stuff on Linux.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
Merlin Moncure
Date:
On Sat, Jul 24, 2010 at 3:20 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:
> Hello list,
>
> Probably like many other's I've wondered why no SSD manufacturer puts a
> small BBU on a SSD drive. Triggered by Greg Smith's mail
> http://archives.postgresql.org/pgsql-performance/2010-02/msg00291.php here,
> and also anandtech's review at http://www.anandtech.com/show/2899/1 (see
> page 6 for pictures of the capacitor) I ordered a SandForce drive and this
> week it finally arrived.
>
> And now I have to test it and was wondering about some things like
>
> * How to test for power failure?

I test like this: write a small program that sends a endless series of
inserts like this:
*) on the server:
create table foo (id serial);
*) from the client:
insert into foo default values returning id;
on the client side print the inserted value to the terminal after the
query is reported as complete to the client.

Run the program, wait a bit, then pull the plug on the server.  The
database should recover clean and the last reported insert on the
client should be there when it restarts.  Try restarting immediately a
few times then if that works try it and let it simmer overnight.  If
it makes it at least 24-48 hours that's a very promising sign.

merlin

Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Greg Smith wrote:
> Note that not all of the Sandforce drives include a capacitor; I hope
> you got one that does!  I wasn't aware any of the SF drives with a
> capacitor on them were even shipping yet, all of the ones I'd seen
> were the chipset that doesn't include one still.  Haven't checked in a
> few weeks though.
I think I did, it was expensive enough, though while ordering its very
easy to order the wrong one, all names on the product category page look
the same. (OCZ Vertex 2 Pro)
>> * How to test for power failure?
>
> I've had good results using one of the early programs used to
> investigate this class of problems:
> http://brad.livejournal.com/2116715.html?page=2
A great tool, thanks for the link!

  diskchecker: running 34 sec, 4.10% coverage of 500 MB (1342 writes; 39/s)
  diskchecker: running 35 sec, 4.24% coverage of 500 MB (1390 writes; 39/s)
  diskchecker: running 36 sec, 4.35% coverage of 500 MB (1427 writes; 39/s)
  diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; 39/s)
didn't get 'ok' from server (11387 316950), msg=[] = Connection reset by
peer at ./diskchecker.pl line 132.

here's where I removed the power and left it off for about a minute.
Then started again then did the verify

yeb@a:~$ ./diskchecker.pl -s client45.eemnes verify test_file
 verifying: 0.00%
Total errors: 0

:-)
this was on ext2

>> * What filesystem to use on the SSD? To minimize writes and maximize
>> chance for seeing errors I'd choose ext2 here.
>
> I don't consider there to be any reason to deploy any part of a
> PostgreSQL database on ext2.  The potential for downtime if the fsck
> doesn't happen automatically far outweighs the minimal performance
> advantage you'll actually see in real applications.
Hmm.. wouldn't that apply for other filesystems as well? I know that JFS
also won't mount if booted unclean, it somehow needs a marker from the
fsck. Don't know for ext3, xfs etc.
> All of the benchmarks showing large gains for ext2 over ext3 I have
> seen been synthetic, not real database performance; the internal ones
> I've run using things like pgbench do not show a significant
> improvement.  (Yes, I'm already working on finding time to publicly
> release those findings)
The reason I'd choose ext2 on the SSD was mainly to decrease the number
of writes, not for performance. Maybe I should ultimately do tests for
both journalled and ext2 filesystems and compare the amount of data per
x pgbench transactions.
> Put it on ext3, toggle on noatime, and move on to testing.  The
> overhead of the metadata writes is the least of the problems when
> doing write-heavy stuff on Linux.
Will surely do and post the results.

thanks,
Yeb Havinga

Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Yeb Havinga wrote:
> diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes; 39/s)
> Total errors: 0
>
> :-)
OTOH, I now notice the 39 write /s .. If that means ~ 39 tps... bummer.



Re: Testing Sandforce SSD

From
Greg Smith
Date:
Greg Smith wrote:
> Note that not all of the Sandforce drives include a capacitor; I hope
> you got one that does!  I wasn't aware any of the SF drives with a
> capacitor on them were even shipping yet, all of the ones I'd seen
> were the chipset that doesn't include one still.  Haven't checked in a
> few weeks though.

Answer my own question here:  the drive Yeb got was the brand spanking
new OCZ Vertex 2 Pro, selling for $649 at Newegg for example:
http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and with
the supercacitor listed right in the main production specifications
there.  This is officially the first inexpensive (relatively) SSD with a
battery-backed write cache built into it.  If Yeb's test results prove
it works as it's supposed to under PostgreSQL, I'll be happy to finally
have a moderately priced SSD I can recommend to people for database
use.  And I fear I'll be out of excuses to avoid buying one as a toy for
my home system.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
"Joshua D. Drake"
Date:
On Sat, 2010-07-24 at 16:21 -0400, Greg Smith wrote:
> Greg Smith wrote:
> > Note that not all of the Sandforce drives include a capacitor; I hope
> > you got one that does!  I wasn't aware any of the SF drives with a
> > capacitor on them were even shipping yet, all of the ones I'd seen
> > were the chipset that doesn't include one still.  Haven't checked in a
> > few weeks though.
>
> Answer my own question here:  the drive Yeb got was the brand spanking
> new OCZ Vertex 2 Pro, selling for $649 at Newegg for example:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and with
> the supercacitor listed right in the main production specifications
> there.  This is officially the first inexpensive (relatively) SSD with a
> battery-backed write cache built into it.  If Yeb's test results prove
> it works as it's supposed to under PostgreSQL, I'll be happy to finally
> have a moderately priced SSD I can recommend to people for database
> use.  And I fear I'll be out of excuses to avoid buying one as a toy for
> my home system.

That is quite the toy. I can get 4 SATA-II with RAID Controller, with
battery backed cache, for the same price or less :P

Sincerely,

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: Testing Sandforce SSD

From
Greg Smith
Date:
Joshua D. Drake wrote:
> That is quite the toy. I can get 4 SATA-II with RAID Controller, with
> battery backed cache, for the same price or less :P
>

True, but if you look at tests like
http://www.anandtech.com/show/2899/12 it suggests there's probably at
least a 6:1 performance speedup for workloads with a lot of random I/O
to them.  And I'm really getting sick of the power/noise/heat that the 6
drives in my home server produces.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Yeb Havinga wrote:
> Yeb Havinga wrote:
>> diskchecker: running 37 sec, 4.47% coverage of 500 MB (1468 writes;
>> 39/s)
>> Total errors: 0
>>
>> :-)
> OTOH, I now notice the 39 write /s .. If that means ~ 39 tps... bummer.
When playing with it a bit more, I couldn't get the test_file to be
created in the right place on the test system. It turns out I had the
diskchecker config switched and 39 write/s was the speed of the
not-rebooted system, sorry.

I did several diskchecker.pl tests this time with the testfile on the
SSD, none of the tests have returned an error :-)

Writes/s start low but quickly converge to a number in the range of 1200
to 1800. The writes diskchecker does are 16kB writes. Making this 4kB
writes does not increase writes/s. 32kB seems a little less, 64kB is
about two third of initial writes/s and 128kB is half.

So no BBU speeds here for writes, but still ~ factor 10 improvement of
iops for a rotating SATA disk.

regards,
Yeb Havinga

PS: hdparm showed write cache was on. I did tests with both ext2 and
xfs, where xfs tests I did with both barrier and nobarrier.


Re: Testing Sandforce SSD

From
Greg Smith
Date:
Yeb Havinga wrote:
> Writes/s start low but quickly converge to a number in the range of
> 1200 to 1800. The writes diskchecker does are 16kB writes. Making this
> 4kB writes does not increase writes/s. 32kB seems a little less, 64kB
> is about two third of initial writes/s and 128kB is half.

Let's turn that into MB/s numbers:

4k * 1200 = 4.7 MB/s
8k * 1200 = 9.4 MB/s
16k * 1200 = 18.75 MB/s
64kb * 1200 * 2/3 [800] = 37.5 MB/s
128kb * 1200 / 2 [600] = 75 MB/s

For comparison sake, a 7200 RPM drive running PostgreSQL will do <120
commits/second without a BBWC, so at an 8K block size that's <1 MB/s.
If you put a cache in the middle, I'm used to seeing about 5000 8K
commits/second, which is around 40 MB/s.  So this is sitting right in
the middle of those two.  Sequential writes with a commit after each one
like this are basically the worst case for the SSD, so if it can provide
reasonable performance on that I'd be happy.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
"Joshua D. Drake"
Date:
On Sat, 2010-07-24 at 16:21 -0400, Greg Smith wrote:
> Greg Smith wrote:
> > Note that not all of the Sandforce drives include a capacitor; I hope
> > you got one that does!  I wasn't aware any of the SF drives with a
> > capacitor on them were even shipping yet, all of the ones I'd seen
> > were the chipset that doesn't include one still.  Haven't checked in a
> > few weeks though.
>
> Answer my own question here:  the drive Yeb got was the brand spanking
> new OCZ Vertex 2 Pro, selling for $649 at Newegg for example:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and with
> the supercacitor listed right in the main production specifications
> there.  This is officially the first inexpensive (relatively) SSD with a
> battery-backed write cache built into it.  If Yeb's test results prove
> it works as it's supposed to under PostgreSQL, I'll be happy to finally
> have a moderately priced SSD I can recommend to people for database
> use.  And I fear I'll be out of excuses to avoid buying one as a toy for
> my home system.

That is quite the toy. I can get 4 SATA-II with RAID Controller, with
battery backed cache, for the same price or less :P

Sincerely,

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt


Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Greg Smith wrote:
> Put it on ext3, toggle on noatime, and move on to testing.  The
> overhead of the metadata writes is the least of the problems when
> doing write-heavy stuff on Linux.
I ran a pgbench run and power failure test during pgbench with a 3 year
old computer

8GB DDR ?
Intel Core 2 duo 6600 @ 2.40GHz
Intel Corporation 82801IB (ICH9) 2 port SATA IDE Controller
64 bit 2.6.31-22-server (Ubuntu karmic), kernel option elevator=deadline
sysctl options besides increasing shm:
fs.file-max=327679
fs.aio-max-nr=3145728
vm.swappiness=0
vm.dirty_background_ratio = 3
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 100
vm.dirty_ratio = 15

Filesystem on SSD with postgresql data: ext3 mounted with
noatime,nodiratime,relatime
Postgresql cluster: did initdb with C locale. Data and pg_xlog together
on the same ext3 filesystem.

Changed in postgresql.conf: settings with pgtune for OLTP and 15 connections
maintenance_work_mem = 480MB # pgtune wizard 2010-07-25
checkpoint_completion_target = 0.9 # pgtune wizard 2010-07-25
effective_cache_size = 5632MB # pgtune wizard 2010-07-25
work_mem = 512MB # pgtune wizard 2010-07-25
wal_buffers = 8MB # pgtune wizard 2010-07-25
checkpoint_segments = 31 # pgtune said 16 here
shared_buffers = 1920MB # pgtune wizard 2010-07-25
max_connections = 15 # pgtune wizard 2010-07-25

Initialized with scale 800 with is about 12GB. I especially went beyond
an in RAM size for this machine (that would be ~ 5GB), so random reads
would weigh in the result. Then let pgbench run the tcp benchmark with
-M prepared, 10 clients and -T 3600 (one hour) and 10 clients, after
that loaded the logfile in a db and did some queries. Then realized the
pgbench result page was not in screen buffer anymore so I cannot copy it
here, but hey, those can be edited as well right ;-)

select count(*),count(*)/3600,avg(time),stddev(time) from log;
  count  | ?column? |          avg          |     stddev
---------+----------+-----------------------+----------------
 4939212 |     1372 | 7282.8581978258880161 | 11253.96967962
(1 row)

Judging from the latencys in the logfiles I did not experience serious
lagging (time is in microseconds):

select * from log order by time desc limit 3;
 client_id | tx_no |  time   | file_no |   epoch    | time_us
-----------+-------+---------+---------+------------+---------
         3 | 33100 | 1229503 |       0 | 1280060345 |  866650
         9 | 39990 | 1077519 |       0 | 1280060345 |  858702
         2 | 55323 | 1071060 |       0 | 1280060519 |  750861
(3 rows)

select * from log order by time desc limit 3 OFFSET 1000;
 client_id | tx_no  |  time  | file_no |   epoch    | time_us
-----------+--------+--------+---------+------------+---------
         5 | 262466 | 245953 |       0 | 1280062074 |  513789
         1 | 267519 | 245867 |       0 | 1280062074 |  513301
         7 | 273662 | 245532 |       0 | 1280062078 |  378932
(3 rows)

select * from log order by time desc limit 3 OFFSET 10000;
 client_id | tx_no  | time  | file_no |   epoch    | time_us
-----------+--------+-------+---------+------------+---------
         5 | 123011 | 82854 |       0 | 1280061036 |  743986
         6 | 348967 | 82853 |       0 | 1280062687 |  776317
         8 | 439789 | 82848 |       0 | 1280063109 |  552928
(3 rows)

Then I started pgbench again with the same setting, let it run for a few
minutes and in another console did CHECKPOINT and then turned off power.
After restarting, the database recovered without a problem.

LOG:  database system was interrupted; last known up at 2010-07-25
10:14:15 EDT
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at F/98008610
LOG:  record with zero length at F/A2BAC040
LOG:  redo done at F/A2BAC010
LOG:  last completed transaction was at log time 2010-07-25
10:14:16.151037-04

regards,
Yeb Havinga

Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Yeb Havinga wrote:
>
> 8GB DDR2 something..
(lots of details removed)

Graph of TPS at http://tinypic.com/r/b96aup/3 and latency at
http://tinypic.com/r/x5e846/3

Thanks http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm for
the gnuplot and psql scripts!


Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Yeb Havinga wrote:
> Greg Smith wrote:
>> Put it on ext3, toggle on noatime, and move on to testing.  The
>> overhead of the metadata writes is the least of the problems when
>> doing write-heavy stuff on Linux.
> I ran a pgbench run and power failure test during pgbench with a 3
> year old computer
>
On the same config more tests.

scale 10 read only and read/write tests. note: only 240 s.

starting vacuum...end.
transaction type: SELECT only
scaling factor: 10
query mode: prepared
number of clients: 10
duration: 240 s
number of transactions actually processed: 8208115
tps = 34197.109896 (including connections establishing)
tps = 34200.658720 (excluding connections establishing)

yeb@client45:~$ pgbench -c 10 -l -M prepared -T 240 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: prepared
number of clients: 10
duration: 240 s
number of transactions actually processed: 809271
tps = 3371.147020 (including connections establishing)
tps = 3371.518611 (excluding connections establishing)

----------
scale 300 (just fits in RAM) read only and read/write tests

pgbench -c 10 -M prepared -T 300 -S test
starting vacuum...end.
transaction type: SELECT only
scaling factor: 300
query mode: prepared
number of clients: 10
duration: 300 s
number of transactions actually processed: 9219279
tps = 30726.931095 (including connections establishing)
tps = 30729.692823 (excluding connections establishing)

The test above doesn't really test the drive but shows the CPU/RAM limit.

pgbench -c 10 -l -M prepared -T 3600 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 10
duration: 3600 s
number of transactions actually processed: 8838200
tps = 2454.994217 (including connections establishing)
tps = 2455.012480 (excluding connections establishing)

------
scale 2000

pgbench -c 10 -M prepared -T 300 -S test
starting vacuum...end.
transaction type: SELECT only
scaling factor: 2000
query mode: prepared
number of clients: 10
duration: 300 s
number of transactions actually processed: 755772
tps = 2518.547576 (including connections establishing)
tps = 2518.762476 (excluding connections establishing)

So the test above tests the random seek performance. Iostat on the drive
showed a steady just over 4000 read io's/s:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.39    0.00   13.37   60.40    0.00   14.85
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00 4171.00    0.00 60624.00     0.00
29.07    11.81    2.83   0.24 100.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00

pgbench -c 10 -l -M prepared -T 24000 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 2000
query mode: prepared
number of clients: 10
duration: 24000 s
number of transactions actually processed: 30815691
tps = 1283.979098 (including connections establishing)
tps = 1283.980446 (excluding connections establishing)

Note the duration of several hours. No long waits occurred - of this
last test the latency png is at http://yfrog.com/f/0vlatencywp/ and the
TPS graph at http://yfrog.com/f/b5tpsp/

regards,
Yeb Havinga


Re: Testing Sandforce SSD

From
Matthew Wakeling
Date:
On Sun, 25 Jul 2010, Yeb Havinga wrote:
> Graph of TPS at http://tinypic.com/r/b96aup/3 and latency at
> http://tinypic.com/r/x5e846/3

Does your latency graph really have milliseconds as the y axis? If so,
this device is really slow - some requests have a latency of more than a
second!

Matthew

--
 The early bird gets the worm, but the second mouse gets the cheese.

Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Matthew Wakeling wrote:
> On Sun, 25 Jul 2010, Yeb Havinga wrote:
>> Graph of TPS at http://tinypic.com/r/b96aup/3 and latency at
>> http://tinypic.com/r/x5e846/3
>
> Does your latency graph really have milliseconds as the y axis?
Yes
> If so, this device is really slow - some requests have a latency of
> more than a second!
I try to just give the facts. Please remember that particular graphs are
from a read/write pgbench run on a bigger than RAM database that ran for
some time (so with checkpoints), on a *single* $435 50GB drive without
BBU raid controller. Also, this is a picture with a few million points:
the ones above 200ms are perhaps a hundred and hence make up a very
small fraction.

So far I'm pretty impressed with this drive. Lets be fair to OCZ and the
SandForce guys and do not shoot from the hip things like "really slow",
without that being backed by a graphed pgbench run together with it's
cost, so we can compare numbers with numbers.

regards,
Yeb Havinga


Re: Testing Sandforce SSD

From
Greg Smith
Date:
Matthew Wakeling wrote:
> Does your latency graph really have milliseconds as the y axis? If so,
> this device is really slow - some requests have a latency of more than
> a second!

Have you tried that yourself?  If you generate one of those with
standard hard drives and a BBWC under Linux, I expect you'll discover
those latencies to be >5 seconds long.  I recently saw >100 *seconds*
running a large pgbench test due to latency flushing things to disk, on
a system with 72GB of RAM.  Takes a long time to flush >3GB of random
I/O out to disk when the kernel will happily cache that many writes
until checkpoint time.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
Greg Smith
Date:
Yeb Havinga wrote:
> Please remember that particular graphs are from a read/write pgbench
> run on a bigger than RAM database that ran for some time (so with
> checkpoints), on a *single* $435 50GB drive without BBU raid controller.

To get similar *average* performance results you'd need to put about 4
drives and a BBU into a server.  The worst-case latency on that solution
is pretty bad though, when a lot of random writes are queued up; I
suspect that's where the SSD will look much better.

By the way:  if you want to run a lot more tests in an organized
fashion, that's what http://github.com/gregs1104/pgbench-tools was
written to do.  That will spit out graphs by client and by scale showing
how sensitive the test results are to each.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
Matthew Wakeling
Date:
On Mon, 26 Jul 2010, Greg Smith wrote:
> Matthew Wakeling wrote:
>> Does your latency graph really have milliseconds as the y axis? If so, this
>> device is really slow - some requests have a latency of more than a second!
>
> Have you tried that yourself?  If you generate one of those with standard
> hard drives and a BBWC under Linux, I expect you'll discover those latencies
> to be >5 seconds long.  I recently saw >100 *seconds* running a large pgbench
> test due to latency flushing things to disk, on a system with 72GB of RAM.
> Takes a long time to flush >3GB of random I/O out to disk when the kernel
> will happily cache that many writes until checkpoint time.

Apologies, I was interpreting the graph as the latency of the device, not
all the layers in-between as well. There isn't any indication in the email
with the graph as to what the test conditions or software are. Obviously
if you factor in checkpoints and the OS writing out everything, then you
would have to expect some large latency operations. However, if the device
itself behaved as in the graph, I would be most unhappy and send it back.

Yeb also made the point - there are far too many points on that graph to
really tell what the average latency is. It'd be instructive to have a few
figures, like "only x% of requests took longer than y".

Matthew

--
 I wouldn't be so paranoid if you weren't all out to get me!!

Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Matthew Wakeling wrote:
> Apologies, I was interpreting the graph as the latency of the device,
> not all the layers in-between as well. There isn't any indication in
> the email with the graph as to what the test conditions or software are.
That info was in the email preceding the graph mail, but I see now I
forgot to mention it was a 8.4.4 postgres version.

regards,
Yeb Havinga


Re: Testing Sandforce SSD

From
Greg Spiegelberg
Date:
On Mon, Jul 26, 2010 at 10:26 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:
Matthew Wakeling wrote:
Apologies, I was interpreting the graph as the latency of the device, not all the layers in-between as well. There isn't any indication in the email with the graph as to what the test conditions or software are.
That info was in the email preceding the graph mail, but I see now I forgot to mention it was a 8.4.4 postgres version.


Speaking of the layers in-between, has this test been done with the ext3 journal on a different device?  Maybe the purpose is wrong for the SSD.  Use the SSD for the ext3 journal and the spindled drives for filesystem?  Another possibility is to use ext2 on the SSD.

Greg

Re: Testing Sandforce SSD

From
Greg Smith
Date:
Matthew Wakeling wrote:
> Yeb also made the point - there are far too many points on that graph
> to really tell what the average latency is. It'd be instructive to
> have a few figures, like "only x% of requests took longer than y".

Average latency is the inverse of TPS.  So if the result is, say, 1200
TPS, that means the average latency is 1 / (1200 transactions/second) =
0.83 milliseconds/transaction.  The average TPS figure is normally on a
more useful scale as far as being able to compare them in ways that make
sense to people.

pgbench-tools derives average, worst-case, and 90th percentile figures
for latency from the logs.  I have 37MB worth of graphs from a system
showing how all this typically works for regular hard drives I've been
given permission to publish; just need to find a place to host it at
internally and I'll make the whole stack available to the world.  So far
Yeb's data is showing that a single SSD is competitive with a small
array on average, but with better worst-case behavior than I'm used to
seeing.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
Greg Smith
Date:
Greg Spiegelberg wrote:
> Speaking of the layers in-between, has this test been done with the
> ext3 journal on a different device?  Maybe the purpose is wrong for
> the SSD.  Use the SSD for the ext3 journal and the spindled drives for
> filesystem?

The main disk bottleneck on PostgreSQL databases are the random seeks
for reading and writing to the main data blocks.  The journal
information is practically noise in comparison--it barely matters
because it's so much less difficult to keep up with.  This is why I
don't really find ext2 interesting either.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
"Kevin Grittner"
Date:
Greg Smith <greg@2ndquadrant.com> wrote:

> Yeb's data is showing that a single SSD is competitive with a
> small array on average, but with better worst-case behavior than
> I'm used to seeing.

So, how long before someone benchmarks a small array of SSDs?  :-)

-Kevin

Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Greg Smith wrote:
> Yeb Havinga wrote:
>> Please remember that particular graphs are from a read/write pgbench
>> run on a bigger than RAM database that ran for some time (so with
>> checkpoints), on a *single* $435 50GB drive without BBU raid controller.
>
> To get similar *average* performance results you'd need to put about 4
> drives and a BBU into a server.  The worst-case latency on that
> solution is pretty bad though, when a lot of random writes are queued
> up; I suspect that's where the SSD will look much better.
>
> By the way:  if you want to run a lot more tests in an organized
> fashion, that's what http://github.com/gregs1104/pgbench-tools was
> written to do.  That will spit out graphs by client and by scale
> showing how sensitive the test results are to each.
Got it, running the default config right now.

When you say 'comparable to a small array' - could you give a ballpark
figure for 'small'?

regards,
Yeb Havinga

PS: Some update on the testing: I did some ext3,ext4,xfs,jfs and also
ext2 tests on the just-in-memory read/write test. (scale 300) No real
winners or losers, though ext2 isn't really faster and the manual need
for fix (y) during boot makes it impractical in its standard
configuration. I did some poweroff tests with barriers explicitily off
in ext3, ext4 and xfs, still all recoveries went ok.


Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Yeb Havinga wrote:
>> To get similar *average* performance results you'd need to put about
>> 4 drives and a BBU into a server.  The
>
Please forget this question, I now see it in the mail i'm replying to.
Sorry for the spam!

-- Yeb


Re: Testing Sandforce SSD

From
Greg Smith
Date:
Yeb Havinga wrote:
> I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory
> read/write test. (scale 300) No real winners or losers, though ext2
> isn't really faster and the manual need for fix (y) during boot makes
> it impractical in its standard configuration.

That's what happens every time I try it too.  The theoretical benefits
of ext2 for hosting PostgreSQL just don't translate into significant
performance increases on database oriented tests, certainly not ones
that would justify the downside of having fsck issues come back again.
Glad to see that holds true on this hardware too.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
Scott Marlowe
Date:
On Mon, Jul 26, 2010 at 12:40 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Greg Spiegelberg wrote:
>>
>> Speaking of the layers in-between, has this test been done with the ext3
>> journal on a different device?  Maybe the purpose is wrong for the SSD.  Use
>> the SSD for the ext3 journal and the spindled drives for filesystem?
>
> The main disk bottleneck on PostgreSQL databases are the random seeks for
> reading and writing to the main data blocks.  The journal information is
> practically noise in comparison--it barely matters because it's so much less
> difficult to keep up with.  This is why I don't really find ext2 interesting
> either.

Note that SSDs aren't usually real fast at large sequential writes
though, so it might be worth putting pg_xlog on a spinning pair in a
mirror and seeing how much, if any, the SSD drive speeds up when not
having to do pg_xlog.

Re: Testing Sandforce SSD

From
Greg Spiegelberg
Date:
On Mon, Jul 26, 2010 at 1:45 PM, Greg Smith <greg@2ndquadrant.com> wrote:
Yeb Havinga wrote:
I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory read/write test. (scale 300) No real winners or losers, though ext2 isn't really faster and the manual need for fix (y) during boot makes it impractical in its standard configuration.

That's what happens every time I try it too.  The theoretical benefits of ext2 for hosting PostgreSQL just don't translate into significant performance increases on database oriented tests, certainly not ones that would justify the downside of having fsck issues come back again.  Glad to see that holds true on this hardware too.


I know I'm talking development now but is there a case for a pg_xlog block device to remove the file system overhead and guaranteeing your data is written sequentially every time?

Greg

Re: Testing Sandforce SSD

From
Andres Freund
Date:
On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote:
> On Mon, Jul 26, 2010 at 1:45 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> > Yeb Havinga wrote:
> >> I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory
> >> read/write test. (scale 300) No real winners or losers, though ext2 isn't
> >> really faster and the manual need for fix (y) during boot makes it
> >> impractical in its standard configuration.
> >>
> >
> > That's what happens every time I try it too.  The theoretical benefits of
> > ext2 for hosting PostgreSQL just don't translate into significant
> > performance increases on database oriented tests, certainly not ones that
> > would justify the downside of having fsck issues come back again.  Glad to
> > see that holds true on this hardware too.
> I know I'm talking development now but is there a case for a pg_xlog block
> device to remove the file system overhead and guaranteeing your data is
> written sequentially every time?
For one I doubt that its a relevant enough efficiency loss in
comparison with a significantly significantly complex implementation
(for one you cant grow/shrink, for another you have to do more
complex, hw-dependent things like rounding to hardware boundaries,
page size etc to stay efficient) for another my experience is that at
a relatively low point XlogInsert gets to be the bottleneck - so I
don't see much point in improving at that low level (yet at least).

Where I would like to do some hw dependent measuring (because I see
significant improvements there) would be prefetching for seqscan,
indexscans et al. using blktrace... But I currently dont have the
time. And its another topic ;-)

Andres

Re: Testing Sandforce SSD

From
Greg Smith
Date:
Greg Spiegelberg wrote:
> I know I'm talking development now but is there a case for a pg_xlog
> block device to remove the file system overhead and guaranteeing your
> data is written sequentially every time?

It's possible to set the PostgreSQL wal_sync_method parameter in the
database to open_datasync or open_sync, and if you have an operating
system that supports direct writes it will use those and bypass things
like the OS write cache.  That's close to what you're suggesting,
supposedly portable, and it does show some significant benefit when it's
properly supported.  Problem has been, the synchronous writing code on
Linux in particular hasn't ever worked right against ext3, and the
PostgreSQL code doesn't make the right call at all on Solaris.  So
there's two popular platforms that it just plain doesn't work on, even
though it should.

We've gotten reports that there are bleeding edge Linux kernel and
library versions available now that finally fix that issue, and that
PostgreSQL automatically takes advantage of them when it's compiled on
one of them.  But I'm not aware of any distribution that makes this easy
to try out that's available yet, paint is still wet on the code I think.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
Hannu Krosing
Date:
On Mon, 2010-07-26 at 14:34 -0400, Greg Smith wrote:
> Matthew Wakeling wrote:
> > Yeb also made the point - there are far too many points on that graph
> > to really tell what the average latency is. It'd be instructive to
> > have a few figures, like "only x% of requests took longer than y".
>
> Average latency is the inverse of TPS.  So if the result is, say, 1200
> TPS, that means the average latency is 1 / (1200 transactions/second) =
> 0.83 milliseconds/transaction.

This is probably only true if you run all transactions sequentially in
one connection?

If you run 10 parallel threads and get 1200 sec, the average transaction
time (latency?) is probably closer to 8.3 ms ?

>  The average TPS figure is normally on a
> more useful scale as far as being able to compare them in ways that make
> sense to people.
>
> pgbench-tools derives average, worst-case, and 90th percentile figures
> for latency from the logs.  I have 37MB worth of graphs from a system
> showing how all this typically works for regular hard drives I've been
> given permission to publish; just need to find a place to host it at
> internally and I'll make the whole stack available to the world.  So far
> Yeb's data is showing that a single SSD is competitive with a small
> array on average, but with better worst-case behavior than I'm used to
> seeing.
>
> --
> Greg Smith  2ndQuadrant US  Baltimore, MD
> PostgreSQL Training, Services and Support
> greg@2ndQuadrant.com   www.2ndQuadrant.us
>
>


--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training



Re: Testing Sandforce SSD

From
Michael Stone
Date:
On Mon, Jul 26, 2010 at 01:47:14PM -0600, Scott Marlowe wrote:
>Note that SSDs aren't usually real fast at large sequential writes
>though, so it might be worth putting pg_xlog on a spinning pair in a
>mirror and seeing how much, if any, the SSD drive speeds up when not
>having to do pg_xlog.

xlog is also where I use ext2; it does bench faster for me in that
config, and the fsck issues don't really exist because you're not in a
situation with a lot of files being created/removed.

Mike Stone

Re: Testing Sandforce SSD

From
Michael Stone
Date:
On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote:
>I know I'm talking development now but is there a case for a pg_xlog block
>device to remove the file system overhead and guaranteeing your data is
>written sequentially every time?

If you dedicate a partition to xlog, you already get that in practice
with no extra devlopment.

Mike Stone

Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Michael Stone wrote:
> On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote:
>> I know I'm talking development now but is there a case for a pg_xlog
>> block
>> device to remove the file system overhead and guaranteeing your data is
>> written sequentially every time?
>
> If you dedicate a partition to xlog, you already get that in practice
> with no extra devlopment.
Due to the LBA remapping of the SSD, I'm not sure of putting files that
are sequentially written in a different partition (together with e.g.
tables) would make a difference: in the end the SSD will have a set new
blocks in it's buffer and somehow arrange them into sets of 128KB of
256KB writes for the flash chips. See also
http://www.anandtech.com/show/2899/2

But I ran out of ideas to test, so I'm going to test it anyway.

regards,
Yeb Havinga


Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Yeb Havinga wrote:
> Michael Stone wrote:
>> On Mon, Jul 26, 2010 at 03:23:20PM -0600, Greg Spiegelberg wrote:
>>> I know I'm talking development now but is there a case for a pg_xlog
>>> block
>>> device to remove the file system overhead and guaranteeing your data is
>>> written sequentially every time?
>>
>> If you dedicate a partition to xlog, you already get that in practice
>> with no extra devlopment.
> Due to the LBA remapping of the SSD, I'm not sure of putting files
> that are sequentially written in a different partition (together with
> e.g. tables) would make a difference: in the end the SSD will have a
> set new blocks in it's buffer and somehow arrange them into sets of
> 128KB of 256KB writes for the flash chips. See also
> http://www.anandtech.com/show/2899/2
>
> But I ran out of ideas to test, so I'm going to test it anyway.
Same machine config as mentioned before, with data and xlog on separate
partitions, ext3 with barrier off (save on this SSD).

pgbench -c 10 -M prepared -T 3600 -l test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 10
duration: 3600 s
number of transactions actually processed: 10856359
tps = 3015.560252 (including connections establishing)
tps = 3015.575739 (excluding connections establishing)

This is about 25% faster than data and xlog combined on the same filesystem.

Below is output from iostat -xk 1 -p /dev/sda, which shows each second
per partition statistics.
sda2 is data, sda3 is xlog In the third second a checkpoint seems to start.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          63.50    0.00   30.50    2.50    0.00    3.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  6518.00   36.00 2211.00   148.00 35524.00
31.75     0.28    0.12   0.11  25.00
sda1              0.00     2.00    0.00    5.00     0.00   636.00
254.40     0.03    6.00   2.00   1.00
sda2              0.00   218.00   36.00   40.00   148.00  1032.00
31.05     0.00    0.00   0.00   0.00
sda3              0.00  6298.00    0.00 2166.00     0.00 33856.00
31.26     0.25    0.12   0.12  25.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          60.50    0.00   37.50    0.50    0.00    1.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  6514.00   33.00 2283.00   140.00 35188.00
30.51     0.32    0.14   0.13  29.00
sda1              0.00     0.00    0.00    3.00     0.00    12.00
8.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00   33.00    2.00   140.00     8.00
8.46     0.03    0.86   0.29   1.00
sda3              0.00  6514.00    0.00 2278.00     0.00 35168.00
30.88     0.29    0.13   0.13  29.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          33.00    0.00   34.00   18.00    0.00   15.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  3782.00    7.00 7235.00    28.00 44068.00
12.18    69.52    9.46   0.09  62.00
sda1              0.00     0.00    0.00    1.00     0.00     4.00
8.00     0.00    0.00   0.00   0.00
sda2              0.00   322.00    7.00 6018.00    28.00 25360.00
8.43    69.22   11.33   0.08  47.00
sda3              0.00  3460.00    0.00 1222.00     0.00 18728.00
30.65     0.30    0.25   0.25  30.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.00    0.00   36.00   22.50    0.00   32.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  1079.00    3.00 11110.00    12.00 49060.00
8.83   120.64   10.95   0.08  86.00
sda1              0.00     2.00    0.00    2.00     0.00   320.00
320.00     0.12   60.00  35.00   7.00
sda2              0.00    30.00    3.00 10739.00    12.00 43076.00
8.02   120.49   11.30   0.08  83.00
sda3              0.00  1047.00    0.00  363.00     0.00  5640.00
31.07     0.03    0.08   0.08   3.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          62.00    0.00   31.00    2.00    0.00    5.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  6267.00   51.00 2493.00   208.00 35040.00
27.71     1.80    0.71   0.12  31.00
sda1              0.00     0.00    0.00    3.00     0.00    12.00
8.00     0.00    0.00   0.00   0.00
sda2              0.00   123.00   51.00  344.00   208.00  1868.00
10.51     1.50    3.80   0.10   4.00
sda3              0.00  6144.00    0.00 2146.00     0.00 33160.00
30.90     0.30    0.14   0.14  30.00


Re: Testing Sandforce SSD

From
Greg Spiegelberg
Date:
On Wed, Jul 28, 2010 at 9:18 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:
Yeb Havinga wrote:
Due to the LBA remapping of the SSD, I'm not sure of putting files that are sequentially written in a different partition (together with e.g. tables) would make a difference: in the end the SSD will have a set new blocks in it's buffer and somehow arrange them into sets of 128KB of 256KB writes for the flash chips. See also http://www.anandtech.com/show/2899/2

But I ran out of ideas to test, so I'm going to test it anyway.
Same machine config as mentioned before, with data and xlog on separate partitions, ext3 with barrier off (save on this SSD).

pgbench -c 10 -M prepared -T 3600 -l test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 10
duration: 3600 s
number of transactions actually processed: 10856359
tps = 3015.560252 (including connections establishing)
tps = 3015.575739 (excluding connections establishing)

This is about 25% faster than data and xlog combined on the same filesystem.


The trick may be in kjournald for which there is 1 for each ext3 journalled file system.  I learned back in Red Hat 4 pre U4 kernels there was a problem with kjournald that would either cause 30 second hangs or lock up my server completely when pg_xlog and data were on the same file system plus a few other "right" things going on.

Given the multicore world we have today, I think it makes sense that multiple ext3 file systems, and the kjournald's that service them, is faster than a single combined file system.


Greg

Re: Testing Sandforce SSD

From
Michael Stone
Date:
On Wed, Jul 28, 2010 at 03:45:23PM +0200, Yeb Havinga wrote:
>Due to the LBA remapping of the SSD, I'm not sure of putting files
>that are sequentially written in a different partition (together with
>e.g. tables) would make a difference: in the end the SSD will have a
>set new blocks in it's buffer and somehow arrange them into sets of
>128KB of 256KB writes for the flash chips. See also
>http://www.anandtech.com/show/2899/2

It's not a question of the hardware side, it's the software. The xlog
needs to by synchronized, and the things the filesystem has to do to
make that happen penalize the non-xlog disk activity. That's why my
preferred config is xlog on ext2, rest on xfs. That allows the
synchronous activity to happen with minimal overhead, while the parts
that benefit from having more data in flight can do that freely.

Mike Stone

Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Greg Smith wrote:
> Greg Smith wrote:
>> Note that not all of the Sandforce drives include a capacitor; I hope
>> you got one that does!  I wasn't aware any of the SF drives with a
>> capacitor on them were even shipping yet, all of the ones I'd seen
>> were the chipset that doesn't include one still.  Haven't checked in
>> a few weeks though.
>
> Answer my own question here:  the drive Yeb got was the brand spanking
> new OCZ Vertex 2 Pro, selling for $649 at Newegg for example:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and
> with the supercacitor listed right in the main production
> specifications there.  This is officially the first inexpensive
> (relatively) SSD with a battery-backed write cache built into it.  If
> Yeb's test results prove it works as it's supposed to under
> PostgreSQL, I'll be happy to finally have a moderately priced SSD I
> can recommend to people for database use.  And I fear I'll be out of
> excuses to avoid buying one as a toy for my home system.
>
Hello list,

After a week testing I think I can answer the question above: does it
work like it's supposed to under PostgreSQL?

YES

The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro,
http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534

* it is safe to mount filesystems with barrier off, since it has a
'supercap backed cache'. That data is not lost is confirmed by a dozen
power switch off tests while running either diskchecker.pl or pgbench.
* the above implies its also safe to use this SSD with barriers, though
that will perform less, since this drive obeys write trough commands.
* the highest pgbench tps number for the TPC-B test for a scale 300
database (~5GB) I could get was over 6700. Judging from the iostat
average util of ~40% on the xlog partition, I believe that this number
is limited by other factors than the SSD, like CPU, core count, core
MHz, memory size/speed, 8.4 pgbench without threads. Unfortunately I
don't have a faster/more core machines available for testing right now.
* pgbench numbers for a larger than RAM database, read only was over
25000 tps (details are at the end of this post), during which iostat
reported ~18500 read iops and 100% utilization.
* pgbench max reported latencies are 20% of comparable BBWC setups.
* how reliable it is over time, and how it performs over time I cannot
say, since I tested it only for a week.

regards,
Yeb Havinga

PS: ofcourse all claims I make here are without any warranty. All
information in this mail is for reference purposes, I do not claim it is
suitable for your database setup.

Some info on configuration:
BOOT_IMAGE=/boot/vmlinuz-2.6.32-22-server  elevator=deadline
quad core AMD Phenom(tm) II X4 940 Processor on 3.0GHz
16GB RAM 667MHz DDR2

Disk/ filesystem settings.
Model Family:     OCZ Vertex SSD
Device Model:     OCZ VERTEX2-PRO
Firmware Version: 1.10

hdparm: did not change standard settings: write cache is on, as well as
readahead.
 hdparm -AW /dev/sdc
/dev/sdc:
 look-ahead    =  1 (on)
 write-caching =  1 (on)

Untuned ext4 filesystem.
Mount options
/dev/sdc2 on /data type ext4
(rw,noatime,nodiratime,relatime,barrier=0,discard)
/dev/sdc3 on /xlog type ext4
(rw,noatime,nodiratime,relatime,barrier=0,discard)
Note the -o discard: this means use of the automatic SSD trimming on a
new linux kernel.
Also, per core per filesystem there now is a [ext4-dio-unwrit] process -
which suggest something like 'directio'? I haven't investigated this any
further.

Sysctl:
(copied from a larger RAM database machine)
kernel.core_uses_pid = 1
fs.file-max = 327679
net.ipv4.ip_local_port_range = 1024 65000
kernel.msgmni = 2878
kernel.msgmax = 8192
kernel.msgmnb = 65536
kernel.sem = 250 32000 100 142
kernel.shmmni = 4096
kernel.sysrq = 1
kernel.shmmax = 33794121728
kernel.shmall = 16777216
net.core.rmem_default = 262144
net.core.rmem_max = 2097152
net.core.wmem_default = 262144
net.core.wmem_max = 262144
fs.aio-max-nr = 3145728
vm.swappiness = 0
vm.dirty_background_ratio = 3
vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 100
vm.dirty_ratio = 15

Postgres settings:
8.4.4
--with-blocksize=4
I saw about 10% increase in performance compared to 8KB blocksizes.

Postgresql.conf:
changed from default config are:
maintenance_work_mem = 480MB # pgtune wizard 2010-07-25
checkpoint_completion_target = 0.9 # pgtune wizard 2010-07-25
effective_cache_size = 5632MB # pgtune wizard 2010-07-25
work_mem = 512MB # pgtune wizard 2010-07-25
wal_buffers = 8MB # pgtune wizard 2010-07-25
checkpoint_segments = 128 # pgtune said 16 here
shared_buffers = 1920MB # pgtune wizard 2010-07-25
max_connections = 100

initdb with data on sda2 and xlog on sda3, C locale

Read write test on ~5GB database:
$ pgbench -v -c 20 -M prepared -T 3600 test
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 20
duration: 3600 s
number of transactions actually processed: 24291875
tps = 6747.665859 (including connections establishing)
tps = 6747.721665 (excluding connections establishing)

Read only test on larger than RAM ~23GB database (server has 16GB
fysical RAM) :
$ pgbench -c 20 -M prepared -T 300 -S test
starting vacuum...end.
transaction type: SELECT only
*scaling factor: 1500*
query mode: prepared
number of clients: 20
duration: 300 s
number of transactions actually processed: 7556469
tps = 25184.056498 (including connections establishing)
tps = 25186.336911 (excluding connections establishing)

IOstat reports ~18500 reads/s and ~185 read MB/s during this read only
test on the data partition with 100% util.


Re: Testing Sandforce SSD

From
Karl Denninger
Date:
6700tps?!  Wow......

Ok, I'm impressed.  May wait a bit for prices to come somewhat, but that
sounds like two of those are going in one of my production machines
(Raid 1, of course)

Yeb Havinga wrote:
> Greg Smith wrote:
>> Greg Smith wrote:
>>> Note that not all of the Sandforce drives include a capacitor; I
>>> hope you got one that does!  I wasn't aware any of the SF drives
>>> with a capacitor on them were even shipping yet, all of the ones I'd
>>> seen were the chipset that doesn't include one still.  Haven't
>>> checked in a few weeks though.
>>
>> Answer my own question here:  the drive Yeb got was the brand
>> spanking new OCZ Vertex 2 Pro, selling for $649 at Newegg for
>> example:
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and
>> with the supercacitor listed right in the main production
>> specifications there.  This is officially the first inexpensive
>> (relatively) SSD with a battery-backed write cache built into it.  If
>> Yeb's test results prove it works as it's supposed to under
>> PostgreSQL, I'll be happy to finally have a moderately priced SSD I
>> can recommend to people for database use.  And I fear I'll be out of
>> excuses to avoid buying one as a toy for my home system.
>>
> Hello list,
>
> After a week testing I think I can answer the question above: does it
> work like it's supposed to under PostgreSQL?
>
> YES
>
> The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro,
> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534
>
> * it is safe to mount filesystems with barrier off, since it has a
> 'supercap backed cache'. That data is not lost is confirmed by a dozen
> power switch off tests while running either diskchecker.pl or pgbench.
> * the above implies its also safe to use this SSD with barriers,
> though that will perform less, since this drive obeys write trough
> commands.
> * the highest pgbench tps number for the TPC-B test for a scale 300
> database (~5GB) I could get was over 6700. Judging from the iostat
> average util of ~40% on the xlog partition, I believe that this number
> is limited by other factors than the SSD, like CPU, core count, core
> MHz, memory size/speed, 8.4 pgbench without threads. Unfortunately I
> don't have a faster/more core machines available for testing right now.
> * pgbench numbers for a larger than RAM database, read only was over
> 25000 tps (details are at the end of this post), during which iostat
> reported ~18500 read iops and 100% utilization.
> * pgbench max reported latencies are 20% of comparable BBWC setups.
> * how reliable it is over time, and how it performs over time I cannot
> say, since I tested it only for a week.
>
> regards,
> Yeb Havinga
>
> PS: ofcourse all claims I make here are without any warranty. All
> information in this mail is for reference purposes, I do not claim it
> is suitable for your database setup.
>
> Some info on configuration:
> BOOT_IMAGE=/boot/vmlinuz-2.6.32-22-server  elevator=deadline
> quad core AMD Phenom(tm) II X4 940 Processor on 3.0GHz
> 16GB RAM 667MHz DDR2
>
> Disk/ filesystem settings.
> Model Family:     OCZ Vertex SSD
> Device Model:     OCZ VERTEX2-PRO
> Firmware Version: 1.10
>
> hdparm: did not change standard settings: write cache is on, as well
> as readahead.
> hdparm -AW /dev/sdc
> /dev/sdc:
> look-ahead    =  1 (on)
> write-caching =  1 (on)
>
> Untuned ext4 filesystem.
> Mount options
> /dev/sdc2 on /data type ext4
> (rw,noatime,nodiratime,relatime,barrier=0,discard)
> /dev/sdc3 on /xlog type ext4
> (rw,noatime,nodiratime,relatime,barrier=0,discard)
> Note the -o discard: this means use of the automatic SSD trimming on a
> new linux kernel.
> Also, per core per filesystem there now is a [ext4-dio-unwrit] process
> - which suggest something like 'directio'? I haven't investigated this
> any further.
>
> Sysctl:
> (copied from a larger RAM database machine)
> kernel.core_uses_pid = 1
> fs.file-max = 327679
> net.ipv4.ip_local_port_range = 1024 65000
> kernel.msgmni = 2878
> kernel.msgmax = 8192
> kernel.msgmnb = 65536
> kernel.sem = 250 32000 100 142
> kernel.shmmni = 4096
> kernel.sysrq = 1
> kernel.shmmax = 33794121728
> kernel.shmall = 16777216
> net.core.rmem_default = 262144
> net.core.rmem_max = 2097152
> net.core.wmem_default = 262144
> net.core.wmem_max = 262144
> fs.aio-max-nr = 3145728
> vm.swappiness = 0
> vm.dirty_background_ratio = 3
> vm.dirty_expire_centisecs = 500
> vm.dirty_writeback_centisecs = 100
> vm.dirty_ratio = 15
>
> Postgres settings:
> 8.4.4
> --with-blocksize=4
> I saw about 10% increase in performance compared to 8KB blocksizes.
>
> Postgresql.conf:
> changed from default config are:
> maintenance_work_mem = 480MB # pgtune wizard 2010-07-25
> checkpoint_completion_target = 0.9 # pgtune wizard 2010-07-25
> effective_cache_size = 5632MB # pgtune wizard 2010-07-25
> work_mem = 512MB # pgtune wizard 2010-07-25
> wal_buffers = 8MB # pgtune wizard 2010-07-25
> checkpoint_segments = 128 # pgtune said 16 here
> shared_buffers = 1920MB # pgtune wizard 2010-07-25
> max_connections = 100
>
> initdb with data on sda2 and xlog on sda3, C locale
>
> Read write test on ~5GB database:
> $ pgbench -v -c 20 -M prepared -T 3600 test
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: TPC-B (sort of)
> scaling factor: 300
> query mode: prepared
> number of clients: 20
> duration: 3600 s
> number of transactions actually processed: 24291875
> tps = 6747.665859 (including connections establishing)
> tps = 6747.721665 (excluding connections establishing)
>
> Read only test on larger than RAM ~23GB database (server has 16GB
> fysical RAM) :
> $ pgbench -c 20 -M prepared -T 300 -S test
> starting vacuum...end.
> transaction type: SELECT only
> *scaling factor: 1500*
> query mode: prepared
> number of clients: 20
> duration: 300 s
> number of transactions actually processed: 7556469
> tps = 25184.056498 (including connections establishing)
> tps = 25186.336911 (excluding connections establishing)
>
> IOstat reports ~18500 reads/s and ~185 read MB/s during this read only
> test on the data partition with 100% util.
>
>

Attachment

Re: Testing Sandforce SSD

From
Merlin Moncure
Date:
On Fri, Jul 30, 2010 at 11:01 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:
> After a week testing I think I can answer the question above: does it work
> like it's supposed to under PostgreSQL?
>
> YES
>
> The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro,
> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534
>
> * it is safe to mount filesystems with barrier off, since it has a 'supercap
> backed cache'. That data is not lost is confirmed by a dozen power switch
> off tests while running either diskchecker.pl or pgbench.
> * the above implies its also safe to use this SSD with barriers, though that
> will perform less, since this drive obeys write trough commands.
> * the highest pgbench tps number for the TPC-B test for a scale 300 database
> (~5GB) I could get was over 6700. Judging from the iostat average util of
> ~40% on the xlog partition, I believe that this number is limited by other
> factors than the SSD, like CPU, core count, core MHz, memory size/speed, 8.4
> pgbench without threads. Unfortunately I don't have a faster/more core
> machines available for testing right now.
> * pgbench numbers for a larger than RAM database, read only was over 25000
> tps (details are at the end of this post), during which iostat reported
> ~18500 read iops and 100% utilization.
> * pgbench max reported latencies are 20% of comparable BBWC setups.
> * how reliable it is over time, and how it performs over time I cannot say,
> since I tested it only for a week.

Thank you very much for posting this analysis.  This has IMNSHO the
potential to be a game changer.  There are still some unanswered
questions in terms of how the drive wears, reliability, errors, and
lifespan but 6700 tps off of a single 400$ device with decent fault
tolerance is amazing (Intel, consider yourself upstaged).  Ever since
the first samsung SSD hit the market I've felt the days of the
spinning disk have been numbered.  Being able to build a 100k tps
server on relatively inexpensive hardware without an entire rack full
of drives is starting to look within reach.

> Postgres settings:
> 8.4.4
> --with-blocksize=4
> I saw about 10% increase in performance compared to 8KB blocksizes.

That's very interesting -- we need more testing in that department...

regards (and thanks again)
merlin

Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Merlin Moncure wrote:
> On Fri, Jul 30, 2010 at 11:01 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:
>
>> Postgres settings:
>> 8.4.4
>> --with-blocksize=4
>> I saw about 10% increase in performance compared to 8KB blocksizes.
>>
>
> That's very interesting -- we need more testing in that department...
>
Definately - that 10% number was on the old-first hardware (the core 2
E6600). After reading my post and the 185MBps with 18500 reads/s number
I was a bit suspicious whether I did the tests on the new hardware with
4K, because 185MBps / 18500 reads/s is ~10KB / read, so I thought thats
a lot closer to 8KB than 4KB. I checked with show block_size and it was
4K. Then I redid the tests on the new server with the default 8KB
blocksize and got about 4700 tps (TPC-B/300)... 67/47 = 1.47. So it
seems that on newer hardware, the difference is larger than 10%.

regards,
Yeb Havinga


Re: Testing Sandforce SSD

From
Josh Berkus
Date:
> Definately - that 10% number was on the old-first hardware (the core 2
> E6600). After reading my post and the 185MBps with 18500 reads/s number
> I was a bit suspicious whether I did the tests on the new hardware with
> 4K, because 185MBps / 18500 reads/s is ~10KB / read, so I thought thats
> a lot closer to 8KB than 4KB. I checked with show block_size and it was
> 4K. Then I redid the tests on the new server with the default 8KB
> blocksize and got about 4700 tps (TPC-B/300)... 67/47 = 1.47. So it
> seems that on newer hardware, the difference is larger than 10%.

That doesn't make much sense unless there's some special advantage to a
4K blocksize with the hardware itself.  Can you just do a basic
filesystem test (like Bonnie++) with a 4K vs. 8K blocksize?

Also, are you running your pgbench tests more than once, just to account
for randomizing?

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Testing Sandforce SSD

From
Greg Smith
Date:
Josh Berkus wrote:
> That doesn't make much sense unless there's some special advantage to a
> 4K blocksize with the hardware itself.

Given that pgbench is always doing tiny updates to blocks, I wouldn't be
surprised if switching to smaller blocks helps it in a lot of situations
if one went looking for them.  Also, as you point out, pgbench runtime
varies around wildly enough that 10% would need more investigation to
really prove that means something.  But I think Yeb has done plenty of
investigation into the most interesting part here, the durability claims.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
Scott Marlowe
Date:
On Mon, Aug 2, 2010 at 6:07 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Josh Berkus wrote:
>>
>> That doesn't make much sense unless there's some special advantage to a
>> 4K blocksize with the hardware itself.
>
> Given that pgbench is always doing tiny updates to blocks, I wouldn't be
> surprised if switching to smaller blocks helps it in a lot of situations if
> one went looking for them.  Also, as you point out, pgbench runtime varies
> around wildly enough that 10% would need more investigation to really prove
> that means something.  But I think Yeb has done plenty of investigation into
> the most interesting part here, the durability claims.

Running the tests for longer helps a lot on reducing the noisy
results.  Also letting them runs longer means that the background
writer and autovacuum start getting involved, so the test becomes
somewhat more realistic.

Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Scott Marlowe wrote:
> On Mon, Aug 2, 2010 at 6:07 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>
>> Josh Berkus wrote:
>>
>>> That doesn't make much sense unless there's some special advantage to a
>>> 4K blocksize with the hardware itself.
>>>
>> Given that pgbench is always doing tiny updates to blocks, I wouldn't be
>> surprised if switching to smaller blocks helps it in a lot of situations if
>> one went looking for them.  Also, as you point out, pgbench runtime varies
>> around wildly enough that 10% would need more investigation to really prove
>> that means something.  But I think Yeb has done plenty of investigation into
>> the most interesting part here, the durability claims.
>>
Please note that the 10% was on a slower CPU. On a more recent CPU the
difference was 47%, based on tests that ran for an hour. That's why I
absolutely agree with Merlin Moncure that more testing in this
department is welcome, preferably by others since after all I could be
on the pay roll of OCZ :-)

I looked a bit into Bonnie++ but fail to see how I could do a test that
somehow matches the PostgreSQL setup during the pgbench tests (db that
fits in memory, so the test is actually how fast the ssd can capture
sequential WAL writes and fsync without barriers, mixed with an
occasional checkpoint with random write IO on another partition). Since
the WAL writing is the same for both block_size setups, I decided to
compare random writes to a file of 5GB with Oracle's Orion tool:

=== 4K test summary ====
ORION VERSION 11.1.0.7.0

Commandline:
-testname test -run oltp -size_small 4 -size_large 1024 -write 100

This maps to this test:
Test: test
Small IO size: 4 KB
Large IO size: 1024 KB
IO Types: Small Random IOs, Large Random IOs
Simulated Array Type: CONCAT
Write: 100%
Cache Size: Not Entered
Duration for each Data Point: 60 seconds
Small Columns:,      1,      2,      3,      4,      5,      6,
7,      8,      9,     10,     11,     12,     13,     14,     15,
16,     17,     18,     19,     20
Large Columns:,      0
Total Data Points: 21

Name: /mnt/data/5gb     Size: 5242880000
1 FILEs found.

Maximum Small IOPS=86883 @ Small=8 and Large=0
Minimum Small Latency=0.01 @ Small=1 and Large=0

=== 8K test summary ====

ORION VERSION 11.1.0.7.0

Commandline:
-testname test -run oltp -size_small 8 -size_large 1024 -write 100

This maps to this test:
Test: test
Small IO size: 8 KB
Large IO size: 1024 KB
IO Types: Small Random IOs, Large Random IOs
Simulated Array Type: CONCAT
Write: 100%
Cache Size: Not Entered
Duration for each Data Point: 60 seconds
Small Columns:,      1,      2,      3,      4,      5,      6,
7,      8,      9,     10,     11,     12,     13,     14,     15,
16,     17,     18,     19,     20
Large Columns:,      0
Total Data Points: 21

Name: /mnt/data/5gb     Size: 5242880000
1 FILEs found.

Maximum Small IOPS=48798 @ Small=11 and Large=0
Minimum Small Latency=0.02 @ Small=1 and Large=0
> Running the tests for longer helps a lot on reducing the noisy
> results.  Also letting them runs longer means that the background
> writer and autovacuum start getting involved, so the test becomes
> somewhat more realistic.
>
Yes, that's why I did a lot of the TPC-B tests with -T 3600 so they'd
run for an hour. (also the 4K vs 8K blocksize in postgres).

regards,
Yeb Havinga


Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Hannu Krosing wrote:
> Did it fit in shared_buffers, or system cache ?
>
Database was ~5GB, server has 16GB, shared buffers was set to 1920MB.
> I first noticed this several years ago, when doing a COPY to a large
> table with indexes took noticably longer (2-3 times longer) when the
> indexes were in system cache than when they were in shared_buffers.
>
I read this as a hint: try increasing shared_buffers. I'll redo the
pgbench run with increased shared_buffers.
>> so the test is actually how fast the ssd can capture
>> sequential WAL writes and fsync without barriers, mixed with an
>> occasional checkpoint with random write IO on another partition). Since
>> the WAL writing is the same for both block_size setups, I decided to
>> compare random writes to a file of 5GB with Oracle's Orion tool:
>>
>
> Are you sure that you are not writing full WAL pages ?
>
I'm not sure I understand this question.
> Do you have any stats on how much WAL is written for 8kb and 4kb test
> cases ?
>
Would some iostat -xk 1 for each partition suffice?
> And for other disk i/o during the tests ?
>
Not existent.

regards,
Yeb Havinga


Re: Testing Sandforce SSD

From
Greg Smith
Date:
Yeb Havinga wrote:
> Small IO size: 4 KB
> Maximum Small IOPS=86883 @ Small=8 and Large=0
>
> Small IO size: 8 KB
> Maximum Small IOPS=48798 @ Small=11 and Large=0

Conclusion:  you can write 4KB blocks almost twice as fast as 8KB ones.
This is a useful observation about the effectiveness of the write cache
on the unit, but not really a surprise.  On ideal hardware performance
should double if you halve the write size.  I already wagered the
difference in pgbench results is caused by the same math.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Testing Sandforce SSD

From
Yeb Havinga
Date:
Yeb Havinga wrote:
> Hannu Krosing wrote:
>> Did it fit in shared_buffers, or system cache ?
>>
> Database was ~5GB, server has 16GB, shared buffers was set to 1920MB.
>> I first noticed this several years ago, when doing a COPY to a large
>> table with indexes took noticably longer (2-3 times longer) when the
>> indexes were in system cache than when they were in shared_buffers.
>>
> I read this as a hint: try increasing shared_buffers. I'll redo the
> pgbench run with increased shared_buffers.
Shared buffers raised from 1920MB to 3520MB:

 pgbench -v -l -c 20 -M prepared -T 1800 test
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: prepared
number of clients: 20
duration: 1800 s
number of transactions actually processed: 12971714
tps = 7206.244065 (including connections establishing)
tps = 7206.349947 (excluding connections establishing)

:-)

Re: Testing Sandforce SSD

From
Merlin Moncure
Date:
On Tue, Aug 3, 2010 at 11:37 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:
> Yeb Havinga wrote:
>>
>> Hannu Krosing wrote:
>>>
>>> Did it fit in shared_buffers, or system cache ?
>>>
>>
>> Database was ~5GB, server has 16GB, shared buffers was set to 1920MB.
>>>
>>> I first noticed this several years ago, when doing a COPY to a large
>>> table with indexes took noticably longer (2-3 times longer) when the
>>> indexes were in system cache than when they were in shared_buffers.
>>>
>>
>> I read this as a hint: try increasing shared_buffers. I'll redo the
>> pgbench run with increased shared_buffers.
>
> Shared buffers raised from 1920MB to 3520MB:
>
> pgbench -v -l -c 20 -M prepared -T 1800 test
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: TPC-B (sort of)
> scaling factor: 300
> query mode: prepared
> number of clients: 20
> duration: 1800 s
> number of transactions actually processed: 12971714
> tps = 7206.244065 (including connections establishing)
> tps = 7206.349947 (excluding connections establishing)
>
> :-)

1) what can we comparing this against (changing only the
shared_buffers setting)?

2) I've heard that some SSD have utilities that you can use to query
the write cycles in order to estimate lifespan.  Does this one, and is
it possible to publish the output (an approximation of the amount of
work along with this would be wonderful)?

merlin

Re: Testing Sandforce SSD

From
Hannu Krosing
Date:
On Tue, 2010-08-03 at 10:40 +0200, Yeb Havinga wrote:
> se note that the 10% was on a slower CPU. On a more recent CPU the
> difference was 47%, based on tests that ran for an hour.

I am not surprised at all that reading and writing almost twice as much
data from/to disk takes 47% longer. If less time is spent on seeking the
amount of data starts playing bigger role.

>  That's why I
> absolutely agree with Merlin Moncure that more testing in this
> department is welcome, preferably by others since after all I could be
> on the pay roll of OCZ :-)

:)


> I looked a bit into Bonnie++ but fail to see how I could do a test that
> somehow matches the PostgreSQL setup during the pgbench tests (db that
> fits in memory,

Did it fit in shared_buffers, or system cache ?

Once we are in high tps ground, the time it takes to move pages between
userspace and system cache starts to play bigger role.

I first noticed this several years ago, when doing a COPY to a large
table with indexes took noticably longer (2-3 times longer) when the
indexes were in system cache than when they were in shared_buffers.

> so the test is actually how fast the ssd can capture
> sequential WAL writes and fsync without barriers, mixed with an
> occasional checkpoint with random write IO on another partition). Since
> the WAL writing is the same for both block_size setups, I decided to
> compare random writes to a file of 5GB with Oracle's Orion tool:

Are you sure that you are not writing full WAL pages ?

Do you have any stats on how much WAL is written for 8kb and 4kb test
cases ?

And for other disk i/o during the tests ?



--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
   Services, Consulting and Training



Re: Testing Sandforce SSD

From
Scott Carey
Date:
On Jul 26, 2010, at 12:45 PM, Greg Smith wrote:

> Yeb Havinga wrote:
>> I did some ext3,ext4,xfs,jfs and also ext2 tests on the just-in-memory
>> read/write test. (scale 300) No real winners or losers, though ext2
>> isn't really faster and the manual need for fix (y) during boot makes
>> it impractical in its standard configuration.
>
> That's what happens every time I try it too.  The theoretical benefits
> of ext2 for hosting PostgreSQL just don't translate into significant
> performance increases on database oriented tests, certainly not ones
> that would justify the downside of having fsck issues come back again.
> Glad to see that holds true on this hardware too.
>

ext2 is slow for many reasons.  ext4 with no journal is significantly faster than ext2.  ext4 with a journal is faster
thanext2. 

> --
> Greg Smith  2ndQuadrant US  Baltimore, MD
> PostgreSQL Training, Services and Support
> greg@2ndQuadrant.com   www.2ndQuadrant.us
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Testing Sandforce SSD

From
Scott Carey
Date:
On Aug 2, 2010, at 7:26 AM, Merlin Moncure wrote:

> On Fri, Jul 30, 2010 at 11:01 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:
>> After a week testing I think I can answer the question above: does it work
>> like it's supposed to under PostgreSQL?
>>
>> YES
>>
>> The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro,
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534
>>
>> * it is safe to mount filesystems with barrier off, since it has a 'supercap
>> backed cache'. That data is not lost is confirmed by a dozen power switch
>> off tests while running either diskchecker.pl or pgbench.
>> * the above implies its also safe to use this SSD with barriers, though that
>> will perform less, since this drive obeys write trough commands.
>> * the highest pgbench tps number for the TPC-B test for a scale 300 database
>> (~5GB) I could get was over 6700. Judging from the iostat average util of
>> ~40% on the xlog partition, I believe that this number is limited by other
>> factors than the SSD, like CPU, core count, core MHz, memory size/speed, 8.4
>> pgbench without threads. Unfortunately I don't have a faster/more core
>> machines available for testing right now.
>> * pgbench numbers for a larger than RAM database, read only was over 25000
>> tps (details are at the end of this post), during which iostat reported
>> ~18500 read iops and 100% utilization.
>> * pgbench max reported latencies are 20% of comparable BBWC setups.
>> * how reliable it is over time, and how it performs over time I cannot say,
>> since I tested it only for a week.
>
> Thank you very much for posting this analysis.  This has IMNSHO the
> potential to be a game changer.  There are still some unanswered
> questions in terms of how the drive wears, reliability, errors, and
> lifespan but 6700 tps off of a single 400$ device with decent fault
> tolerance is amazing (Intel, consider yourself upstaged).  Ever since
> the first samsung SSD hit the market I've felt the days of the
> spinning disk have been numbered.  Being able to build a 100k tps
> server on relatively inexpensive hardware without an entire rack full
> of drives is starting to look within reach.

Intel's next gen 'enterprise' SSD's are due out later this year.  I have heard from those with access to to test
samplesthat they really like them -- these people rejected the previous versions because of the data loss on power
failure.

So, hopefully there will be some interesting competition later this year in the medium price range enterprise ssd
market.

>
>> Postgres settings:
>> 8.4.4
>> --with-blocksize=4
>> I saw about 10% increase in performance compared to 8KB blocksizes.
>
> That's very interesting -- we need more testing in that department...
>
> regards (and thanks again)
> merlin
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Testing Sandforce SSD

From
Scott Carey
Date:
On Aug 3, 2010, at 9:27 AM, Merlin Moncure wrote:
>
> 2) I've heard that some SSD have utilities that you can use to query
> the write cycles in order to estimate lifespan.  Does this one, and is
> it possible to publish the output (an approximation of the amount of
> work along with this would be wonderful)?
>

On the intel drives, its available via SMART.  Plenty of hits on how to read the data from google.  Sandforce drives
probablyhave it exposed via SMART as well. 

I have had over 50 X25-M's (80GB G1's) in production for 22 months that write ~100GB a day and SMART reports they have
78%of their write cycles left.  Plus, when it dies from usage it supposedly enters a read-only state.  (these only have
recoverabledata so data loss on power failure is not a concern for me). 

So if Sandforce has low write amplification like Intel (they claim to be better) longevity should be fine.

> merlin
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Testing Sandforce SSD

From
Chris Browne
Date:
greg@2ndquadrant.com (Greg Smith) writes:
> Yeb Havinga wrote:
>> * What filesystem to use on the SSD? To minimize writes and maximize
>> chance for seeing errors I'd choose ext2 here.
>
> I don't consider there to be any reason to deploy any part of a
> PostgreSQL database on ext2.  The potential for downtime if the fsck
> doesn't happen automatically far outweighs the minimal performance
> advantage you'll actually see in real applications.

Ah, but if the goal is to try to torture the SSD as cruelly as possible,
these aren't necessarily downsides (important or otherwise).

I don't think ext2 helps much in "maximizing chances of seeing errors"
in notably useful ways, as the extra "torture" that takes place as part
of the post-remount fsck isn't notably PG-relevant.  (It's not obvious
that errors encountered would be readily mapped to issues relating to
PostgreSQL.)

I think the WAL-oriented test would be *way* more useful; inducing work
whose "brokenness" can be measured in one series of files in one
directory should be way easier than trying to find changes across a
whole PG cluster.  I don't expect the filesystem choice to be terribly
significant to that.
--
"cbbrowne","@","gmail.com"
"Heuristics (from the  French heure, "hour") limit the  amount of time
spent executing something.  [When using heuristics] it shouldn't take
longer than an hour to do something."

Re: Testing Sandforce SSD

From
Chris Browne
Date:
jd@commandprompt.com ("Joshua D. Drake") writes:
> On Sat, 2010-07-24 at 16:21 -0400, Greg Smith wrote:
>> Greg Smith wrote:
>> > Note that not all of the Sandforce drives include a capacitor; I hope
>> > you got one that does!  I wasn't aware any of the SF drives with a
>> > capacitor on them were even shipping yet, all of the ones I'd seen
>> > were the chipset that doesn't include one still.  Haven't checked in a
>> > few weeks though.
>>
>> Answer my own question here:  the drive Yeb got was the brand spanking
>> new OCZ Vertex 2 Pro, selling for $649 at Newegg for example:
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227535 and with
>> the supercacitor listed right in the main production specifications
>> there.  This is officially the first inexpensive (relatively) SSD with a
>> battery-backed write cache built into it.  If Yeb's test results prove
>> it works as it's supposed to under PostgreSQL, I'll be happy to finally
>> have a moderately priced SSD I can recommend to people for database
>> use.  And I fear I'll be out of excuses to avoid buying one as a toy for
>> my home system.
>
> That is quite the toy. I can get 4 SATA-II with RAID Controller, with
> battery backed cache, for the same price or less :P

Sure, but it:
- Fits into a single slot
- Is quiet
- Consumes little power
- Generates little heat
- Is likely to be about as quick as the 4-drive array

It doesn't have the extra 4TB of storage, but if you're building big-ish
databases, metrics have to change anyways.

This is a pretty slick answer for the small OLTP server.
--
output = reverse("moc.liamg" "@" "enworbbc")
http://linuxfinances.info/info/postgresql.html
Chaotic Evil means never having to say you're sorry.

Re: Testing Sandforce SSD

From
Brad Nicholson
Date:
On 10-08-04 03:49 PM, Scott Carey wrote:
> On Aug 2, 2010, at 7:26 AM, Merlin Moncure wrote:
>
>> On Fri, Jul 30, 2010 at 11:01 AM, Yeb Havinga<yebhavinga@gmail.com>  wrote:
>>> After a week testing I think I can answer the question above: does it work
>>> like it's supposed to under PostgreSQL?
>>>
>>> YES
>>>
>>> The drive I have tested is the $435,- 50GB OCZ Vertex 2 Pro,
>>> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227534
>>>
>>> * it is safe to mount filesystems with barrier off, since it has a 'supercap
>>> backed cache'. That data is not lost is confirmed by a dozen power switch
>>> off tests while running either diskchecker.pl or pgbench.
>>> * the above implies its also safe to use this SSD with barriers, though that
>>> will perform less, since this drive obeys write trough commands.
>>> * the highest pgbench tps number for the TPC-B test for a scale 300 database
>>> (~5GB) I could get was over 6700. Judging from the iostat average util of
>>> ~40% on the xlog partition, I believe that this number is limited by other
>>> factors than the SSD, like CPU, core count, core MHz, memory size/speed, 8.4
>>> pgbench without threads. Unfortunately I don't have a faster/more core
>>> machines available for testing right now.
>>> * pgbench numbers for a larger than RAM database, read only was over 25000
>>> tps (details are at the end of this post), during which iostat reported
>>> ~18500 read iops and 100% utilization.
>>> * pgbench max reported latencies are 20% of comparable BBWC setups.
>>> * how reliable it is over time, and how it performs over time I cannot say,
>>> since I tested it only for a week.
>> Thank you very much for posting this analysis.  This has IMNSHO the
>> potential to be a game changer.  There are still some unanswered
>> questions in terms of how the drive wears, reliability, errors, and
>> lifespan but 6700 tps off of a single 400$ device with decent fault
>> tolerance is amazing (Intel, consider yourself upstaged).  Ever since
>> the first samsung SSD hit the market I've felt the days of the
>> spinning disk have been numbered.  Being able to build a 100k tps
>> server on relatively inexpensive hardware without an entire rack full
>> of drives is starting to look within reach.
> Intel's next gen 'enterprise' SSD's are due out later this year.  I have heard from those with access to to test
samplesthat they really like them -- these people rejected the previous versions because of the data loss on power
failure.
>
> So, hopefully there will be some interesting competition later this year in the medium price range enterprise ssd
market.
>

I'll be doing some testing on Enterprise grade SSD's this year.  I'll
also be looking at some hybrid storage products that use as  SSD's as
accelerators mixed with lower cost storage.

--
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



Re: Testing Sandforce SSD

From
Bruce Momjian
Date:
Greg Smith wrote:
> > * How to test for power failure?
>
> I've had good results using one of the early programs used to
> investigate this class of problems:
> http://brad.livejournal.com/2116715.html?page=2

FYI, this tool is mentioned in the Postgres documentation:

    http://www.postgresql.org/docs/9.0/static/wal-reliability.html

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +