Thread: Raid 10 chunksize

Raid 10 chunksize

From
Mark Kirkwood
Date:
I'm trying to pin down some performance issues with a machine where I
work, we are seeing (read only) query response times blow out by an
order of magnitude or more at busy times. Initially we blamed
autovacuum, but after a  tweak of the cost_delay it is *not* the
problem. Then I looked at checkpoints... and altho there was some
correlation with them and the query response - I'm thinking that the
raid chunksize may well be the issue.

Fortunately there is an identical DR box, so I could do a little
testing. Details follow:

Sun 4140 2x quad-core opteron 2356 16G RAM,  6x 15K 140G SAS
Debian Lenny
Pg 8.3.6

The disk is laid out using software (md) raid:

4 drives raid 10 *4K* chunksize with database files (ext3 ordered, noatime)
2 drives raid 1 with database transaction logs (ext3 ordered, noatime)

The relevant non default .conf params are:

shared_buffers = 2048MB
work_mem = 4MB
maintenance_work_mem = 1024MB
max_fsm_pages = 153600
bgwriter_lru_maxpages = 200
wal_buffers = 2MB
checkpoint_segments = 32
effective_cache_size = 4096MB
autovacuum_vacuum_scale_factor = 0.1
autovacuum_vacuum_cost_delay = 60    # This is high, but seemed to help...

I've run pgbench:

transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 655.335102 (including connections establishing)
tps = 655.423232 (excluding connections establishing)


Looking at iostat while it is running shows (note sda-sdd raid10, sde
and sdf raid 1):

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    56.80    0.00  579.00     0.00     2.47
8.74   133.76  235.10   1.73 100.00
sdb               0.00    45.60    0.00  583.60     0.00     2.45
8.59    52.65   90.03   1.71 100.00
sdc               0.00    49.00    0.00  579.80     0.00     2.45
8.66    72.56  125.09   1.72 100.00
sdd               0.00    58.40    0.00  565.00     0.00     2.42
8.79   135.31  235.52   1.77 100.00
sde               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    12.80    0.00   23.40     0.00     0.15
12.85     3.04  103.38   4.27  10.00
sdb               0.00    12.80    0.00   22.80     0.00     0.14
12.77     2.31   73.51   3.58   8.16
sdc               0.00    12.80    0.00   21.40     0.00     0.13
12.86     2.38   79.21   3.63   7.76
sdd               0.00    12.80    0.00   21.80     0.00     0.14
12.70     2.66   90.02   3.93   8.56
sde               0.00  2546.80    0.00  146.80     0.00    10.53
146.94     0.97    6.38   5.34  78.40
sdf               0.00  2546.80    0.00  146.60     0.00    10.53
147.05     0.97    6.38   5.53  81.04

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   231.40    0.00  566.80     0.00     3.16
11.41   124.92  228.26   1.76  99.52
sdb               0.00   223.00    0.00  558.00     0.00     3.06
11.23    46.64   83.55   1.70  94.88
sdc               0.00   230.60    0.00  551.60     0.00     3.07
11.40    94.38  171.54   1.76  96.96
sdd               0.00   231.40    0.00  528.60     0.00     2.94
11.37   122.55  220.81   1.83  96.48
sde               0.00  1495.80    0.00   99.00     0.00     6.23
128.86     0.81    8.15   7.76  76.80
sdf               0.00  1495.80    0.00   99.20     0.00     6.26
129.24     0.73    7.40   7.10  70.48

Top looks like:

Cpu(s):  2.5%us,  1.9%sy,  0.0%ni, 71.9%id, 23.4%wa,  0.2%hi,  0.2%si,
0.0%st
Mem:  16474084k total, 15750384k used,   723700k free,  1654320k buffers
Swap:  2104440k total,      944k used,  2103496k free, 13552720k cached

It looks to me like we are maxing out the raid 10 array, and I suspect
the chunksize (4K) is the culprit. However as this is a pest to change
(!) I'd like some opinions on whether I'm jumping to conclusions. I'd
also appreciate comments about what chunksize to use (I've tended to use
256K in the past, but what are folks preferring these days?)

regards

Mark



Re: Raid 10 chunksize

From
Scott Carey
Date:
On 3/24/09 6:09 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:

> I'm trying to pin down some performance issues with a machine where I
> work, we are seeing (read only) query response times blow out by an
> order of magnitude or more at busy times. Initially we blamed
> autovacuum, but after a  tweak of the cost_delay it is *not* the
> problem. Then I looked at checkpoints... and altho there was some
> correlation with them and the query response - I'm thinking that the
> raid chunksize may well be the issue.
>
> Fortunately there is an identical DR box, so I could do a little
> testing. Details follow:
>
> Sun 4140 2x quad-core opteron 2356 16G RAM,  6x 15K 140G SAS
> Debian Lenny
> Pg 8.3.6
>
> The disk is laid out using software (md) raid:
>
> 4 drives raid 10 *4K* chunksize with database files (ext3 ordered, noatime)
> 2 drives raid 1 with database transaction logs (ext3 ordered, noatime)
>

>
> Top looks like:
>
> Cpu(s):  2.5%us,  1.9%sy,  0.0%ni, 71.9%id, 23.4%wa,  0.2%hi,  0.2%si,
> 0.0%st
> Mem:  16474084k total, 15750384k used,   723700k free,  1654320k buffers
> Swap:  2104440k total,      944k used,  2103496k free, 13552720k cached
>
> It looks to me like we are maxing out the raid 10 array, and I suspect
> the chunksize (4K) is the culprit. However as this is a pest to change
> (!) I'd like some opinions on whether I'm jumping to conclusions. I'd
> also appreciate comments about what chunksize to use (I've tended to use
> 256K in the past, but what are folks preferring these days?)
>
> regards
>
> Mark
>
>

md tends to work great at 1MB chunk sizes with RAID 1 or 10 for whatever
reason.  Unlike a hardware raid card, smaller chunks aren't going to help
random i/o as it won't read the whole 1MB or bother caching much.  Make sure
any partitions built on top of md are 1MB aligned if you go that route.
Random I/O on files smaller than 1MB would be affected -- but that's not a
problem on a 16GB RAM server running a database that won't fit in RAM.

Your xlogs are occasionally close to max usage too -- which is suspicious at
10MB/sec.  There is no reason for them to be on ext3 since they are a
transaction log that syncs writes so file system journaling doesn't mean
anything.  Ext2 there will lower the sync times and reduced i/o utilization.

I also tend to use xfs if sequential access is important at all (obviously
not so in pg_bench).  ext3 is slightly safer in a power failure with unsyncd
data, but Postgres has that covered with its own journal anyway so those
differences are irrelevant.


Re: Raid 10 chunksize

From
David Rees
Date:
On Tue, Mar 24, 2009 at 6:48 PM, Scott Carey <scott@richrelevance.com> wrote:
> Your xlogs are occasionally close to max usage too -- which is suspicious at
> 10MB/sec.  There is no reason for them to be on ext3 since they are a
> transaction log that syncs writes so file system journaling doesn't mean
> anything.  Ext2 there will lower the sync times and reduced i/o utilization.

I would tend to recommend ext3 in data=writeback and make sure that
it's mounted with noatime over using ext2 - for the sole reason that
if the system shuts down unexpectedly, you don't have to worry about a
long fsck when bringing it back up.

Performance between the two filesystems should really be negligible
for Postgres logging.

-Dave

Re: Raid 10 chunksize

From
Scott Marlowe
Date:
On Tue, Mar 24, 2009 at 7:09 PM, Mark Kirkwood <markir@paradise.net.nz> wrote:
> I'm trying to pin down some performance issues with a machine where I work,
> we are seeing (read only) query response times blow out by an order of
> magnitude or more at busy times. Initially we blamed autovacuum, but after a
>  tweak of the cost_delay it is *not* the problem. Then I looked at
> checkpoints... and altho there was some correlation with them and the query
> response - I'm thinking that the raid chunksize may well be the issue.

Sounds to me like you're mostly just running out of bandwidth on your
RAID array.  Whether or not you can tune it to run faster is the real
issue.  This problem becomes worse as you add clients and the RAID
array starts to thrash.  Thrashing is likely to be worse with a small
chunk size, so that's definitely worth a look at fixing.

> Fortunately there is an identical DR box, so I could do a little testing.

Can you try changing the chunksize on the test box you're testing on
to see if that helps?

Re: Raid 10 chunksize

From
Greg Smith
Date:
On Tue, 24 Mar 2009, David Rees wrote:

> I would tend to recommend ext3 in data=writeback and make sure that
> it's mounted with noatime over using ext2 - for the sole reason that
> if the system shuts down unexpectedly, you don't have to worry about a
> long fsck when bringing it back up.

Well, Mark's system is already using noatime, and if you believe

http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/

there's little difference between writeback and ordered on the WAL disk.
Might squeeze out some improvements with ext2 though, and if there's
nothing besides the WAL on there fsck isn't ever going to take very long
anyway--not much of a directory tree to traverse there.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Raid 10 chunksize

From
Mark Kirkwood
Date:
Scott Marlowe wrote:
> On Tue, Mar 24, 2009 at 7:09 PM, Mark Kirkwood <markir@paradise.net.nz> wrote:
>
>> I'm trying to pin down some performance issues with a machine where I work,
>> we are seeing (read only) query response times blow out by an order of
>> magnitude or more at busy times. Initially we blamed autovacuum, but after a
>>  tweak of the cost_delay it is *not* the problem. Then I looked at
>> checkpoints... and altho there was some correlation with them and the query
>> response - I'm thinking that the raid chunksize may well be the issue.
>>
>
> Sounds to me like you're mostly just running out of bandwidth on your
> RAID array.  Whether or not you can tune it to run faster is the real
> issue.  This problem becomes worse as you add clients and the RAID
> array starts to thrash.  Thrashing is likely to be worse with a small
> chunk size, so that's definitely worth a look at fixing.
>
>

Yeah, I was wondering if we are maxing out the bandwidth...
>> Fortunately there is an identical DR box, so I could do a little testing.
>>
>
> Can you try changing the chunksize on the test box you're testing on
> to see if that helps?
>
>

Yes - or I am hoping to anyway (part of posting here was to collect some
outside validation for the idea). Thanks for your input!


Cheers

Mark

Re: Raid 10 chunksize

From
Greg Smith
Date:
On Wed, 25 Mar 2009, Mark Kirkwood wrote:

> I'm thinking that the raid chunksize may well be the issue.

Why?  I'm not saying you're wrong, I just don't see why that parameter
jumped out as a likely cause here.

> Sun 4140 2x quad-core opteron 2356 16G RAM,  6x 15K 140G SAS

That server doesn't have any sort of write cache on it, right?  That means
that all the fsync's done near checkpoint time are going to thrash your
disks around.  One thing you can do to improve that situation is push
checkpoint_segments up to the maximum you can possibly stand.  You could
consider double or even quadruple what you're using right now, the
recovery time after a crash will spike upwards a bunch though.  That will
minimize the number of checkpoints and reduce the average disk I/O they
produce per unit of time, due to how they're spread out in 8.3.  You might
bump upwards checkpoint_completion_target to 0.9 in order to get some
improvement without increasing recovery time as badly.

Also, if you want to minimize total I/O, you might drop
bgwriter_lru_maxpages to 0.  That feature presumes you have some spare I/O
capacity you use to prioritize lower latency, and it sounds like you
don't.  You get the lowest total I/O per transaction with the background
writer turned off.

You happened to catch me on a night where I was running some pgbench tests
here, so I can give you something similar to compare against.  Quad-core
system, 8GB of RAM, write-caching controller with 3-disk RAID0 for
database and 1 disk for WAL; Linux software RAID though.  Here's the same
data you collected at the same scale you're testing, with similar
postgresql.conf settings too (same shared_buffers and
checkpoint_segments, I didn't touch any of the vacuum parameters):

number of clients: 32
number of transactions per client: 6250
number of transactions actually processed: 200000/200000
tps = 1097.933319 (including connections establishing)
tps = 1098.372510 (excluding connections establishing)

Cpu(s):  3.6%us,  1.0%sy,  0.0%ni, 57.2%id, 37.5%wa,  0.0%hi,  0.7%si,  0.0%st
Mem:   8174288k total,  5545396k used,  2628892k free,   473248k buffers
Swap:        0k total,        0k used,        0k free,  4050736k cached

sda,b,d are the database, sdc is the WAL, here's a couple of busy periods:

Device:         rrqm/s   wrqm/s   r/s    w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   337.26  0.00 380.72     0.00     2.83    15.24   104.98  278.77   2.46  93.55
sdb               0.00   343.56  0.00 386.31     0.00     2.86    15.17    91.32  236.61   2.46  94.95
sdd               0.00   342.86  0.00 391.71     0.00     2.92    15.28   128.36  342.42   2.43  95.14
sdc               0.00   808.89  0.00  45.45     0.00     3.35   150.72     1.22   26.75  21.13  96.02

Device:         rrqm/s   wrqm/s   r/s   w/s     rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   377.82  0.00 423.38     0.00     3.13    15.12    74.24  175.21   1.41  59.58
sdb               0.00   371.73  0.00 423.18     0.00     3.13    15.15    50.61  119.81   1.41  59.58
sdd               0.00   372.93  0.00 414.99     0.00     3.06    15.09    60.02  144.32   1.44  59.70
sdc               0.00  3242.16  0.00 258.84     0.00    13.68   108.23     0.88    3.42   2.96  76.60

They don't really look much different from yours.  I'm using software RAID
and haven't touched any of its parameters; didn't even use noatime on the
ext3 filesystems (you should though--that's one of those things the write
cache really helps out with in my case).

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Raid 10 chunksize

From
Jerry Champlin
Date:
It sounds to me like you need to tune everything you can related to
postgresql, but it will unlikely be enough as your load continues to
increase.  You might want to look into moving some of the read activity
off of the database.  Depending on you application, memcached or ehcache
could help.  You could also look at using something like Tokyo Cabinet
as a short term front end data store.  Without understanding the
application architecture, I can't offer much in way of a specific
suggestion.

-Jerry

Jerry Champlin
Absolute Performance Inc.


Mark Kirkwood wrote:
> Scott Marlowe wrote:
>> On Tue, Mar 24, 2009 at 7:09 PM, Mark Kirkwood
>> <markir@paradise.net.nz> wrote:
>>
>>> I'm trying to pin down some performance issues with a machine where
>>> I work,
>>> we are seeing (read only) query response times blow out by an order of
>>> magnitude or more at busy times. Initially we blamed autovacuum, but
>>> after a
>>>  tweak of the cost_delay it is *not* the problem. Then I looked at
>>> checkpoints... and altho there was some correlation with them and
>>> the query
>>> response - I'm thinking that the raid chunksize may well be the issue.
>>>
>>
>> Sounds to me like you're mostly just running out of bandwidth on your
>> RAID array.  Whether or not you can tune it to run faster is the real
>> issue.  This problem becomes worse as you add clients and the RAID
>> array starts to thrash.  Thrashing is likely to be worse with a small
>> chunk size, so that's definitely worth a look at fixing.
>>
>>
>
> Yeah, I was wondering if we are maxing out the bandwidth...
>>> Fortunately there is an identical DR box, so I could do a little
>>> testing.
>>>
>>
>> Can you try changing the chunksize on the test box you're testing on
>> to see if that helps?
>>
>>
>
> Yes - or I am hoping to anyway (part of posting here was to collect
> some outside validation for the idea). Thanks for your input!
>
>
> Cheers
>
> Mark
>

Re: Raid 10 chunksize

From
Scott Carey
Date:
On 3/25/09 1:07 AM, "Greg Smith" <gsmith@gregsmith.com> wrote:

> On Wed, 25 Mar 2009, Mark Kirkwood wrote:
>
>> I'm thinking that the raid chunksize may well be the issue.
>
> Why?  I'm not saying you're wrong, I just don't see why that parameter
> jumped out as a likely cause here.
>

If postgres is random reading or writing at 8k block size, and the raid
array is set with 4k block size, then every 8k random i/o will create TWO
disk seeks since it gets split to two disks.   Effectively, iops will be cut
in half.


Re: Raid 10 chunksize

From
Stef Telford
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark Kirkwood wrote:
> I'm trying to pin down some performance issues with a machine where
>  I work, we are seeing (read only) query response times blow out by
>  an order of magnitude or more at busy times. Initially we blamed
> autovacuum, but after a  tweak of the cost_delay it is *not* the
> problem. Then I looked at checkpoints... and altho there was some
> correlation with them and the query response - I'm thinking that
> the raid chunksize may well be the issue.
>
> Fortunately there is an identical DR box, so I could do a little
> testing. Details follow:
>
> Sun 4140 2x quad-core opteron 2356 16G RAM,  6x 15K 140G SAS Debian
> Lenny Pg 8.3.6
>
> The disk is laid out using software (md) raid:
>
> 4 drives raid 10 *4K* chunksize with database files (ext3 ordered,
> noatime) 2 drives raid 1 with database transaction logs (ext3
> ordered, noatime)
>
> The relevant non default .conf params are:
>
> shared_buffers = 2048MB           work_mem = 4MB
> maintenance_work_mem = 1024MB       max_fsm_pages = 153600
>  bgwriter_lru_maxpages = 200       wal_buffers = 2MB
> checkpoint_segments = 32      effective_cache_size = 4096MB
> autovacuum_vacuum_scale_factor = 0.1   autovacuum_vacuum_cost_delay
>  = 60    # This is high, but seemed to help...
>
> I've run pgbench:
>
> transaction type: TPC-B (sort of) scaling factor: 100 number of
> clients: 24 number of transactions per client: 12000 number of
> transactions actually processed: 288000/288000 tps = 655.335102
> (including connections establishing) tps = 655.423232 (excluding
> connections establishing)
>
>
> Looking at iostat while it is running shows (note sda-sdd raid10,
> sde and sdf raid 1):
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util sda               0.00
> 56.80    0.00  579.00     0.00 2.47     8.74   133.76  235.10
> 1.73 100.00 sdb               0.00    45.60    0.00  583.60
> 0.00 2.45     8.59    52.65   90.03   1.71 100.00 sdc
> 0.00    49.00    0.00  579.80     0.00 2.45     8.66    72.56
> 125.09   1.72 100.00 sdd               0.00    58.40    0.00
> 565.00     0.00 2.42     8.79   135.31  235.52   1.77 100.00 sde
> 0.00     0.00    0.00    0.00     0.00 0.00     0.00     0.00
> 0.00   0.00   0.00 sdf               0.00     0.00    0.00    0.00
> 0.00 0.00     0.00     0.00    0.00   0.00   0.00
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util sda               0.00
> 12.80    0.00   23.40     0.00     0.15 12.85     3.04  103.38
> 4.27  10.00 sdb               0.00    12.80    0.00   22.80
> 0.00     0.14 12.77     2.31   73.51   3.58   8.16 sdc
> 0.00    12.80    0.00   21.40     0.00     0.13 12.86     2.38
> 79.21   3.63   7.76 sdd               0.00    12.80    0.00   21.80
> 0.00     0.14 12.70     2.66   90.02   3.93   8.56 sde
> 0.00  2546.80    0.00  146.80     0.00    10.53 146.94     0.97
> 6.38   5.34  78.40 sdf               0.00  2546.80    0.00  146.60
> 0.00    10.53 147.05     0.97    6.38   5.53  81.04
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util sda               0.00
> 231.40    0.00  566.80     0.00     3.16 11.41   124.92  228.26
> 1.76  99.52 sdb               0.00   223.00    0.00  558.00
> 0.00     3.06 11.23    46.64   83.55   1.70  94.88 sdc
> 0.00   230.60    0.00  551.60     0.00     3.07 11.40    94.38
> 171.54   1.76  96.96 sdd               0.00   231.40    0.00
> 528.60     0.00     2.94 11.37   122.55  220.81   1.83  96.48 sde
> 0.00  1495.80    0.00   99.00     0.00     6.23 128.86     0.81
> 8.15   7.76  76.80 sdf               0.00  1495.80    0.00   99.20
> 0.00     6.26 129.24     0.73    7.40   7.10  70.48
>
> Top looks like:
>
> Cpu(s):  2.5%us,  1.9%sy,  0.0%ni, 71.9%id, 23.4%wa,  0.2%hi,
> 0.2%si,  0.0%st Mem:  16474084k total, 15750384k used,   723700k
> free,  1654320k buffers Swap:  2104440k total,      944k used,
> 2103496k free, 13552720k cached
>
> It looks to me like we are maxing out the raid 10 array, and I
> suspect the chunksize (4K) is the culprit. However as this is a
> pest to change (!) I'd like some opinions on whether I'm jumping to
>  conclusions. I'd also appreciate comments about what chunksize to
> use (I've tended to use 256K in the past, but what are folks
> preferring these days?)
>
> regards
>
> Mark
>
>
>
Hello Mark,
    Okay, so, take all of this with a pinch of salt, but, I have the
same config (pretty much) as you, with checkpoint_Segments raised to
192. The 'test' database server is Q8300, 8GB ram, 2 x 7200rpm SATA
into motherboard which I then lvm stripped together; lvcreate -n
data_lv -i 2 -I 64 mylv -L 60G (expandable under lvm2). That gives me
a stripe size of 64. Running pgbench with the same scaling factors;

starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 1398.907206 (including connections establishing)
tps = 1399.233785 (excluding connections establishing)

    It's also running ext4dev, but, this is the 'playground' server,
not the real iron (And I dread to do that on the real iron). In short,
I think that chunksize/stripesize is killing you. Personally, I would
go for 64 or 128 .. that's jst my 2c .. feel free to
ignore/scorn/laugh as applicable ;)

    Regards
    Stef
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknK0UsACgkQANG7uQ+9D9VK3wCeO/guLVb4K4V7VAQ29hJsmstb
2JMAmQEmJjNTQlxng/49D2/xHNw2W19/
=/rKD
-----END PGP SIGNATURE-----


Re: Raid 10 chunksize

From
Mark Kirkwood
Date:
Greg Smith wrote:
> On Wed, 25 Mar 2009, Mark Kirkwood wrote:
>
>> I'm thinking that the raid chunksize may well be the issue.
>
> Why?  I'm not saying you're wrong, I just don't see why that parameter
> jumped out as a likely cause here.
>

See my other post, however I agree - it wasn't clear whether split
writes (from the small chunksize) were killing us or the array was
simply maxed out...

>> Sun 4140 2x quad-core opteron 2356 16G RAM,  6x 15K 140G SAS
>
> That server doesn't have any sort of write cache on it, right?  That
> means that all the fsync's done near checkpoint time are going to
> thrash your disks around.  One thing you can do to improve that
> situation is push checkpoint_segments up to the maximum you can
> possibly stand.  You could consider double or even quadruple what
> you're using right now, the recovery time after a crash will spike
> upwards a bunch though.  That will minimize the number of checkpoints
> and reduce the average disk I/O they produce per unit of time, due to
> how they're spread out in 8.3.  You might bump upwards
> checkpoint_completion_target to 0.9 in order to get some improvement
> without increasing recovery time as badly.
>

Yeah, no write cache at all.

> Also, if you want to minimize total I/O, you might drop
> bgwriter_lru_maxpages to 0.  That feature presumes you have some spare
> I/O capacity you use to prioritize lower latency, and it sounds like
> you don't.  You get the lowest total I/O per transaction with the
> background writer turned off.
>

Right - but then a big very noticeable stall when you do have to
checkpoint? We want to avoid that I think, even at the cost of a little
overall throughput.

> You happened to catch me on a night where I was running some pgbench
> tests here, so I can give you something similar to compare against.
> Quad-core system, 8GB of RAM, write-caching controller with 3-disk
> RAID0 for database and 1 disk for WAL; Linux software RAID though.
> Here's the same data you collected at the same scale you're testing,
> with similar postgresql.conf settings too (same shared_buffers and
> checkpoint_segments, I didn't touch any of the vacuum parameters):
>
> number of clients: 32
> number of transactions per client: 6250
> number of transactions actually processed: 200000/200000
> tps = 1097.933319 (including connections establishing)
> tps = 1098.372510 (excluding connections establishing)
>
> Cpu(s):  3.6%us,  1.0%sy,  0.0%ni, 57.2%id, 37.5%wa,  0.0%hi,
> 0.7%si,  0.0%st
> Mem:   8174288k total,  5545396k used,  2628892k free,   473248k buffers
> Swap:        0k total,        0k used,        0k free,  4050736k cached
>
> sda,b,d are the database, sdc is the WAL, here's a couple of busy
> periods:
>
> Device:         rrqm/s   wrqm/s   r/s    w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00   337.26  0.00 380.72     0.00     2.83
> 15.24   104.98  278.77   2.46  93.55
> sdb               0.00   343.56  0.00 386.31     0.00     2.86
> 15.17    91.32  236.61   2.46  94.95
> sdd               0.00   342.86  0.00 391.71     0.00     2.92
> 15.28   128.36  342.42   2.43  95.14
> sdc               0.00   808.89  0.00  45.45     0.00     3.35
> 150.72     1.22   26.75  21.13  96.02
>
> Device:         rrqm/s   wrqm/s   r/s   w/s     rMB/s    wMB/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00   377.82  0.00 423.38     0.00     3.13
> 15.12    74.24  175.21   1.41  59.58
> sdb               0.00   371.73  0.00 423.18     0.00     3.13
> 15.15    50.61  119.81   1.41  59.58
> sdd               0.00   372.93  0.00 414.99     0.00     3.06
> 15.09    60.02  144.32   1.44  59.70
> sdc               0.00  3242.16  0.00 258.84     0.00    13.68
> 108.23     0.88    3.42   2.96  76.60
>
> They don't really look much different from yours.  I'm using software
> RAID and haven't touched any of its parameters; didn't even use
> noatime on the ext3 filesystems (you should though--that's one of
> those things the write cache really helps out with in my case).
>
Yeah - with 64K chunksize I'm seeing a result more congruent with yours
(866 or so for 24 clients), I think another pair of disks so we could
have 3 effective disks for the database would help get us to similar
results to yours... however for the meantime I'm trying to get the best
out of what's there!

Thanks for your help

Mark

Re: Raid 10 chunksize

From
Mark Kirkwood
Date:
Stef Telford wrote:
>
> Hello Mark,
>     Okay, so, take all of this with a pinch of salt, but, I have the
> same config (pretty much) as you, with checkpoint_Segments raised to
> 192. The 'test' database server is Q8300, 8GB ram, 2 x 7200rpm SATA
> into motherboard which I then lvm stripped together; lvcreate -n
> data_lv -i 2 -I 64 mylv -L 60G (expandable under lvm2). That gives me
> a stripe size of 64. Running pgbench with the same scaling factors;
>
> starting vacuum...end.
> transaction type: TPC-B (sort of)
> scaling factor: 100
> number of clients: 24
> number of transactions per client: 12000
> number of transactions actually processed: 288000/288000
> tps = 1398.907206 (including connections establishing)
> tps = 1399.233785 (excluding connections establishing)
>
>     It's also running ext4dev, but, this is the 'playground' server,
> not the real iron (And I dread to do that on the real iron). In short,
> I think that chunksize/stripesize is killing you. Personally, I would
> go for 64 or 128 .. that's jst my 2c .. feel free to
> ignore/scorn/laugh as applicable ;)
>
>
Stef - I suspect that your (quite high) tps is because your SATA disks
are not honoring the fsync() request for each commit. SCSI/SAS disks
tend to by default flush their cache at fsync - ATA/SATA tend not to.
Some filesystems (e.g xfs) will try to work around this with write
barrier support, but it depends on the disk firmware.

Thanks for your reply!

Mark

Re: Raid 10 chunksize

From
Mark Kirkwood
Date:
I wrote:
> Scott Marlowe wrote:
>
>>
>> Can you try changing the chunksize on the test box you're testing on
>> to see if that helps?
>>
>>
>
> Yes - or I am hoping to anyway (part of posting here was to collect
> some outside validation for the idea). Thanks for your input!
>

Rebuilt with 64K chunksize:

transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 866.512162 (including connections establishing)
tps = 866.651320 (excluding connections establishing)


So 64K looks quite a bit better. I'll endeavor to try out 256K next week
too.

Mark

Re: Raid 10 chunksize

From
Greg Smith
Date:
On Thu, 26 Mar 2009, Mark Kirkwood wrote:

>> Also, if you want to minimize total I/O, you might drop
>> bgwriter_lru_maxpages to 0.  That feature presumes you have some spare I/O
>> capacity you use to prioritize lower latency, and it sounds like you don't.
>> You get the lowest total I/O per transaction with the background writer
>> turned off.
>>
>
> Right - but then a big very noticeable stall when you do have to checkpoint?
> We want to avoid that I think, even at the cost of a little overall
> throughput.

There's not really a big difference if you're running with a large value
for checkpoing_segments.  That spreads the checkpoint I/O over a longer
period of time.  The current background writer doesn't aim to reduce
writes at checkpoint time, because that never really worked out like
people expected it to anyway.

It's aimed instead to write out buffers that database backend processes
are going to need fairly soon, so they are less likely to block because
they have to write them out themselves.  That leads to an occasional bit
of wasted I/O, where the buffer written out gets used or dirtied against
before it can be assigned to a backend.  I've got a long paper expanding
on the documentation here you might find useful:
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

> Yeah - with 64K chunksize I'm seeing a result more congruent with yours
> (866 or so for 24 clients)

That's good to hear.  If adjusting that helped so much, you might consider
aligning the filesystem partitions to the chunk size too; the partition
header usually screws that up on Linux.  See these two references for
ideas:  http://www.vmware.com/resources/techresources/608
http://spiralbound.net/2008/06/09/creating-linux-partitions-for-clariion

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Raid 10 chunksize

From
Scott Carey
Date:
On 3/25/09 9:43 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:

> Stef Telford wrote:
>>
>> Hello Mark,
>>     Okay, so, take all of this with a pinch of salt, but, I have the
>> same config (pretty much) as you, with checkpoint_Segments raised to
>> 192. The 'test' database server is Q8300, 8GB ram, 2 x 7200rpm SATA
>> into motherboard which I then lvm stripped together; lvcreate -n
>> data_lv -i 2 -I 64 mylv -L 60G (expandable under lvm2). That gives me
>> a stripe size of 64. Running pgbench with the same scaling factors;
>>
>> starting vacuum...end.
>> transaction type: TPC-B (sort of)
>> scaling factor: 100
>> number of clients: 24
>> number of transactions per client: 12000
>> number of transactions actually processed: 288000/288000
>> tps = 1398.907206 (including connections establishing)
>> tps = 1399.233785 (excluding connections establishing)
>>
>>     It's also running ext4dev, but, this is the 'playground' server,
>> not the real iron (And I dread to do that on the real iron). In short,
>> I think that chunksize/stripesize is killing you. Personally, I would
>> go for 64 or 128 .. that's jst my 2c .. feel free to
>> ignore/scorn/laugh as applicable ;)
>>
>>
> Stef - I suspect that your (quite high) tps is because your SATA disks
> are not honoring the fsync() request for each commit. SCSI/SAS disks
> tend to by default flush their cache at fsync - ATA/SATA tend not to.
> Some filesystems (e.g xfs) will try to work around this with write
> barrier support, but it depends on the disk firmware.

This has not been very true for a while now.  SATA disks will flush their
write cache when told, and properly adhere to write barriers.  Of course,
not all file systems send the right write barrier commands and flush
commands to SATA drives (UFS for example, and older versions of ext3).

It may be the other way around, your SAS drives might have the write cache
disabled for no good reason other than to protect against file systems that
don't work right.

>
> Thanks for your reply!
>
> Mark
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: Raid 10 chunksize

From
Scott Carey
Date:
On 3/25/09 9:28 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:

> I wrote:
>> Scott Marlowe wrote:
>>
>>>
>>> Can you try changing the chunksize on the test box you're testing on
>>> to see if that helps?
>>>
>>>
>>
>> Yes - or I am hoping to anyway (part of posting here was to collect
>> some outside validation for the idea). Thanks for your input!
>>
>
> Rebuilt with 64K chunksize:
>
> transaction type: TPC-B (sort of)
> scaling factor: 100
> number of clients: 24
> number of transactions per client: 12000
> number of transactions actually processed: 288000/288000
> tps = 866.512162 (including connections establishing)
> tps = 866.651320 (excluding connections establishing)
>
>
> So 64K looks quite a bit better. I'll endeavor to try out 256K next week
> too.

Just go all the way to 1MB, md _really_ likes 1MB chunk sizes for some
reason.  Benchmarks right and left on google show this to be optimal.  My
tests with md raid 0 over hardware raid 10's ended up with that being
optimal as well.

Greg's notes on aligning partitions to the chunk are key as well.



>
> Mark
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: Raid 10 chunksize

From
Scott Carey
Date:
On 3/26/09 2:44 PM, "Scott Carey" <scott@richrelevance.com> wrote:

>
>
> On 3/25/09 9:43 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:
>
>> Stef Telford wrote:
>>>
>>> Hello Mark,
>>>     Okay, so, take all of this with a pinch of salt, but, I have the
>>> same config (pretty much) as you, with checkpoint_Segments raised to
>>> 192. The 'test' database server is Q8300, 8GB ram, 2 x 7200rpm SATA
>>> into motherboard which I then lvm stripped together; lvcreate -n
>>> data_lv -i 2 -I 64 mylv -L 60G (expandable under lvm2). That gives me
>>> a stripe size of 64. Running pgbench with the same scaling factors;
>>>
>>> starting vacuum...end.
>>> transaction type: TPC-B (sort of)
>>> scaling factor: 100
>>> number of clients: 24
>>> number of transactions per client: 12000
>>> number of transactions actually processed: 288000/288000
>>> tps = 1398.907206 (including connections establishing)
>>> tps = 1399.233785 (excluding connections establishing)
>>>
>>>     It's also running ext4dev, but, this is the 'playground' server,
>>> not the real iron (And I dread to do that on the real iron). In short,
>>> I think that chunksize/stripesize is killing you. Personally, I would
>>> go for 64 or 128 .. that's jst my 2c .. feel free to
>>> ignore/scorn/laugh as applicable ;)
>>>
>>>
>> Stef - I suspect that your (quite high) tps is because your SATA disks
>> are not honoring the fsync() request for each commit. SCSI/SAS disks
>> tend to by default flush their cache at fsync - ATA/SATA tend not to.
>> Some filesystems (e.g xfs) will try to work around this with write
>> barrier support, but it depends on the disk firmware.
>
> This has not been very true for a while now.  SATA disks will flush their
> write cache when told, and properly adhere to write barriers.  Of course,
> not all file systems send the right write barrier commands and flush
> commands to SATA drives (UFS for example, and older versions of ext3).
>
> It may be the other way around, your SAS drives might have the write cache
> disabled for no good reason other than to protect against file systems that
> don't work right.
>

A little extra info here >>  md, LVM, and some other tools do not allow the
file system to use write barriers properly.... So those are on the bad list
for data integrity with SAS or SATA write caches without battery back-up.
However, this is NOT an issue on the postgres data partition.  Data fsync
still works fine, its the file system journal that might have out-of-order
writes.  For xlogs, write barriers are not important, only fsync() not
lying.

As an additional note, ext4 uses checksums per block in the journal, so it
is resistant to out of order writes causing trouble.  The test compared to
here was on ext4, and most likely the speed increase is partly due to that.

>>
>> Thanks for your reply!
>>
>> Mark
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: Raid 10 chunksize

From
Mark Kirkwood
Date:
Scott Carey wrote:
>
> A little extra info here >>  md, LVM, and some other tools do not allow the
> file system to use write barriers properly.... So those are on the bad list
> for data integrity with SAS or SATA write caches without battery back-up.
> However, this is NOT an issue on the postgres data partition.  Data fsync
> still works fine, its the file system journal that might have out-of-order
> writes.  For xlogs, write barriers are not important, only fsync() not
> lying.
>
> As an additional note, ext4 uses checksums per block in the journal, so it
> is resistant to out of order writes causing trouble.  The test compared to
> here was on ext4, and most likely the speed increase is partly due to that.
>
>

[Looks at  Stef's  config - 2x 7200 rpm SATA RAID 0]  I'm still highly
suspicious of such a system being capable of outperforming one with the
same number of (effective) - much faster - disks *plus* a dedicated WAL
disk pair... unless it is being a little loose about fsync! I'm happy to
believe ext4 is better than ext3 - but not that much!

However, its great to have so many different results to compare against!

Cheers

Mark


Re: Raid 10 chunksize

From
Mark Kirkwood
Date:
Scott Carey wrote:
> On 3/25/09 9:28 PM, "Mark Kirkwood" <markir@paradise.net.nz> wrote:
>
>
>>
>> Rebuilt with 64K chunksize:
>>
>> transaction type: TPC-B (sort of)
>> scaling factor: 100
>> number of clients: 24
>> number of transactions per client: 12000
>> number of transactions actually processed: 288000/288000
>> tps = 866.512162 (including connections establishing)
>> tps = 866.651320 (excluding connections establishing)
>>
>>
>> So 64K looks quite a bit better. I'll endeavor to try out 256K next week
>> too.
>>
>
> Just go all the way to 1MB, md _really_ likes 1MB chunk sizes for some
> reason.  Benchmarks right and left on google show this to be optimal.  My
> tests with md raid 0 over hardware raid 10's ended up with that being
> optimal as well.
>
> Greg's notes on aligning partitions to the chunk are key as well.
>
>
Rebuilt with 256K chunksize:

transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 942.852104 (including connections establishing)
tps = 943.019223 (excluding connections establishing)


A noticeable improvement again. I'm not sure that we will have time (or
patience from the system guys that I keep bugging to redo the raid
setup!) to try 1M, but 256K gets us 40% or so improvement over the
original 4K setup  - which is quite nice!

Looking on the net for md raid benchmarks, it is not 100% clear to me
that 1M is the overall best - several I found had tested sizes like 64K,
128K, 512K, 1M and concluded that 1M was best - but without testing
256K! whereas others had included ranges <=512K and decided that that
256K was the best. I'd be very interested in seeing your data! (several
years ago I had carried out this type of testing - on a different type
of machine, and for a different database vendor, but found that 256K
seemed to give the overall best result).

The next step is to align the raid 10 partitions, as you and Greg
suggest and see what effect that has!

Thanks again

Mark

Re: Raid 10 chunksize

From
Stef Telford
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark Kirkwood wrote:
> Scott Carey wrote:
>>
>> A little extra info here >>  md, LVM, and some other tools do not
>>  allow the file system to use write barriers properly.... So
>> those are on the bad list for data integrity with SAS or SATA
>> write caches without battery back-up. However, this is NOT an
>> issue on the postgres data partition.  Data fsync still works
>> fine, its the file system journal that might have out-of-order
>> writes.  For xlogs, write barriers are not important, only
>> fsync() not lying.
>>
>> As an additional note, ext4 uses checksums per block in the
>> journal, so it is resistant to out of order writes causing
>> trouble.  The test compared to here was on ext4, and most likely
>> the speed increase is partly due to that.
>>
>>
>
> [Looks at  Stef's  config - 2x 7200 rpm SATA RAID 0]  I'm still
> highly suspicious of such a system being capable of outperforming
> one with the same number of (effective) - much faster - disks
> *plus* a dedicated WAL disk pair... unless it is being a little
> loose about fsync! I'm happy to believe ext4 is better than ext3 -
> but not that much!
>
> However, its great to have so many different results to compare
> against!
>
> Cheers
>
> Mark
>
Hello Mark,
    For the record, this is a 'base' debian 5 install (with openVZ but
postgreSQL is running on the base hardware, not inside a container)
and I have -explicitly- enabled sync in the conf. Eg;


fsync = on                                            # turns forced
synchronization on or off
synchronous_commit = on                 # immediate fsync at commit
#wal_sync_method = fsync                # the default is the first option


    Infact, if I turn -off- sync commit, it gets about 200 -slower-
rather than faster. Curiously, I also have an intel x25-m winging it's
way here for testing/benching under postgreSQL (along with a vertex
120gb). I had one of the nice lads on the OCZ forum bench against a
30gb vertex ssd, and if you think -my- TPS was crazy.. you should have
seen his.


postgres@rob-desktop:~$ /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t
12000 test_db
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 3662.200088 (including connections establishing)
tps = 3664.823769 (excluding connections establishing)


    (Nb; Thread here;
http://www.ocztechnologyforum.com/forum/showthread.php?t=54038 )

    Curiously, I think with SSD's there may have to be an 'off' flag
if you put the xlog onto an ssd. It seems to complain about 'too
frequent checkpoints'.

    I can't wait for -either- of the drives to arrive. I want to see
in -my- system what the speed is like for SSD's. The dataset I have to
work with is fairly small (30-40GB) so, using an 80GB ssd (even a few
raided) is possible for me. Thankfully ;)

    Regards
    Stef
(ps. I should note, running postgreSQL in a prod environment -without-
a nice UPS is never going to happen on my watch, so, turning on
write-cache (to me) seems like a no-brainer really if it makes this
kind of boost possible)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknTfKMACgkQANG7uQ+9D9XZ7wCfdU3JDXj1f2Em9dt7GdcxRbWR
eHUAn1zDb3HKEiAb0d/0R1MubtE44o/k
=HXmP
-----END PGP SIGNATURE-----


Re: Raid 10 chunksize

From
Greg Smith
Date:
On Wed, 1 Apr 2009, Stef Telford wrote:

> I have -explicitly- enabled sync in the conf...In fact, if I turn -off-
> sync commit, it gets about 200 -slower- rather than faster.

You should take a look at
http://www.postgresql.org/docs/8.3/static/wal-reliability.html

And check the output from "hdparm -I" as suggested there.  If turning off
fsync doesn't improve your performance, there's almost certainly something
wrong with your setup.  As suggested before, your drives probably have
write caching turned on.  PostgreSQL is incapable of knowing that, and
will happily write in an unsafe manner even if the fsync parameter is
turned on.  There's a bunch more information on this topic at
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm

Also:  a run to run variation in pgbench results of +/-10% TPS is normal,
so unless you saw a consistent 200 TPS gain during multiple tests my guess
is that changing fsync for you is doing nothing, rather than you
suggestion that it makes things slower.

> Curiously, I think with SSD's there may have to be an 'off' flag
> if you put the xlog onto an ssd. It seems to complain about 'too
> frequent checkpoints'.

You just need to increase checkpoint_segments from the tiny default if you
want to push any reasonable numbers of transactions/second through pgbench
without seeing this warning.  Same thing happens with any high-performance
disk setup, it's not specific to SSDs.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Raid 10 chunksize

From
Stef Telford
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Greg Smith wrote:
> On Wed, 1 Apr 2009, Stef Telford wrote:
>
>> I have -explicitly- enabled sync in the conf...In fact, if I turn
>>  -off- sync commit, it gets about 200 -slower- rather than
>> faster.
>
> You should take a look at
> http://www.postgresql.org/docs/8.3/static/wal-reliability.html
>
> And check the output from "hdparm -I" as suggested there.  If
> turning off fsync doesn't improve your performance, there's almost
> certainly something wrong with your setup.  As suggested before,
> your drives probably have write caching turned on.  PostgreSQL is
> incapable of knowing that, and will happily write in an unsafe
> manner even if the fsync parameter is turned on.  There's a bunch
> more information on this topic at
> http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm
>
> Also:  a run to run variation in pgbench results of +/-10% TPS is
> normal, so unless you saw a consistent 200 TPS gain during multiple
>  tests my guess is that changing fsync for you is doing nothing,
> rather than you suggestion that it makes things slower.
>
Hello Greg,
    Turning off fsync -does- increase the throughput noticeably,
- -however-, turning off synchronous_commit seemed to slow things down
for me. Your right though, when I toggled the sync_commit on the
system, there was a small variation with TPS coming out between 1100
and 1300. I guess I saw the initial run and thought that there was a
'loss' in sync_commit = off

     I do agree that the benefit is probably from write-caching, but I
think that this is a 'win' as long as you have a UPS or BBU adaptor,
and really, in a prod environment, not having a UPS is .. well. Crazy ?

>> Curiously, I think with SSD's there may have to be an 'off' flag
>> if you put the xlog onto an ssd. It seems to complain about 'too
>> frequent checkpoints'.
>
> You just need to increase checkpoint_segments from the tiny default
>  if you want to push any reasonable numbers of transactions/second
> through pgbench without seeing this warning.  Same thing happens
> with any high-performance disk setup, it's not specific to SSDs.
>
Good to know, I thought it maybe was atypical behaviour due to the
nature of SSD's.
Regards
Stef
> -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com
> Baltimore, MD

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknTky0ACgkQANG7uQ+9D9UuNwCghLLC96mj9zzZPUF4GLvBDlQk
fyIAn0V63YZJGzfm+4zPB9zjm8YKn42X
=A6x2
-----END PGP SIGNATURE-----


Re: Raid 10 chunksize

From
Scott Marlowe
Date:
On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote:
>     I do agree that the benefit is probably from write-caching, but I
> think that this is a 'win' as long as you have a UPS or BBU adaptor,
> and really, in a prod environment, not having a UPS is .. well. Crazy ?

You do know that UPSes can fail, right?  En masse sometimes even.

Re: Raid 10 chunksize

From
Stef Telford
Date:
Scott Marlowe wrote:
> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote:
>
>>     I do agree that the benefit is probably from write-caching, but I
>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>>
>
> You do know that UPSes can fail, right?  En masse sometimes even.
>
Hello Scott,
    Well, the only time the UPS has failed in my memory, was during the
great Eastern Seaboard power outage of 2003. Lots of fond memories
running around Toronto with a gas can looking for oil for generator
power. This said though, anything could happen, the co-lo could be taken
out by a meteor and then sync on or off makes no difference.

    Good UPS, a warm PITR standby, offsite backups and regular checks is
"good enough" for me, and really, that's what it all comes down to.
Mitigating risk and factors into an 'acceptable' amount for each person.
However, if you see over a 2x improvement from turning write-cache 'on'
and have everything else in place, well, that seems like a 'no-brainer'
to me, at least ;)

    Regards
    Stef

Re: Raid 10 chunksize

From
Matthew Wakeling
Date:
On Wed, 1 Apr 2009, Scott Marlowe wrote:
> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote:
>>     I do agree that the benefit is probably from write-caching, but I
>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>
> You do know that UPSes can fail, right?  En masse sometimes even.

I just lost all my diary appointments and address book data on my Palm
device, because of a similar attitude. The device stores all its data in
RAM, and never syncs it to permanent storage (like the SD card in the
expansion slot). But that's fine, right, because it has a battery,
therefore it can never fail? Well, it has the failure mode that if it ever
crashes hard, or the battery fails momentarily due to jogging around in a
pocket, then it just wipes all its data and starts from scratch.

Computers crash. Hardware fails. Relying on un-backed-up RAM to keep your
data safe does not work.

Matthew

--
"Programming today is a race between software engineers striving to build
 bigger and better idiot-proof programs, and the Universe trying to produce
 bigger and better idiots. So far, the Universe is winning."  -- Rich Cook

Re: Raid 10 chunksize

From
Scott Marlowe
Date:
On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <stef@ummon.com> wrote:
> Scott Marlowe wrote:
>> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote:
>>
>>>     I do agree that the benefit is probably from write-caching, but I
>>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>>>
>>
>> You do know that UPSes can fail, right?  En masse sometimes even.
>>
> Hello Scott,
>    Well, the only time the UPS has failed in my memory, was during the
> great Eastern Seaboard power outage of 2003. Lots of fond memories
> running around Toronto with a gas can looking for oil for generator
> power. This said though, anything could happen, the co-lo could be taken
> out by a meteor and then sync on or off makes no difference.

Meteor strike is far less likely than a power surge taking out a UPS.
I saw a whole data center go black when a power conditioner blew out,
taking out the other three power conditioners, both industrial UPSes
and the switch for the diesel generator.  And I have friends who have
seen the same type of thing before as well.  The data is the most
expensive part of any server.

Re: Raid 10 chunksize

From
Matthew Wakeling
Date:
On Wed, 1 Apr 2009, Stef Telford wrote:
>    Good UPS, a warm PITR standby, offsite backups and regular checks is
> "good enough" for me, and really, that's what it all comes down to.
> Mitigating risk and factors into an 'acceptable' amount for each person.
> However, if you see over a 2x improvement from turning write-cache 'on'
> and have everything else in place, well, that seems like a 'no-brainer'
> to me, at least ;)

In that case, buying a battery-backed-up cache in the RAID controller
would be even more of a no-brainer.

Matthew

--
 If pro is the opposite of con, what is the opposite of progress?

Re: Raid 10 chunksize

From
Scott Marlowe
Date:
On Wed, Apr 1, 2009 at 11:01 AM, Matthew Wakeling <matthew@flymine.org> wrote:
> On Wed, 1 Apr 2009, Stef Telford wrote:
>>
>>   Good UPS, a warm PITR standby, offsite backups and regular checks is
>> "good enough" for me, and really, that's what it all comes down to.
>> Mitigating risk and factors into an 'acceptable' amount for each person.
>> However, if you see over a 2x improvement from turning write-cache 'on'
>> and have everything else in place, well, that seems like a 'no-brainer'
>> to me, at least ;)
>
> In that case, buying a battery-backed-up cache in the RAID controller would
> be even more of a no-brainer.

This is especially true in that you can reduce downtime.  A lot of
times downtime costs as much as anything else.

Re: Raid 10 chunksize

From
Stef Telford
Date:
Matthew Wakeling wrote:
> On Wed, 1 Apr 2009, Stef Telford wrote:
>>    Good UPS, a warm PITR standby, offsite backups and regular checks is
>> "good enough" for me, and really, that's what it all comes down to.
>> Mitigating risk and factors into an 'acceptable' amount for each person.
>> However, if you see over a 2x improvement from turning write-cache 'on'
>> and have everything else in place, well, that seems like a 'no-brainer'
>> to me, at least ;)
>
> In that case, buying a battery-backed-up cache in the RAID controller
> would be even more of a no-brainer.
>
> Matthew
>
Hey Matthew,
    See about 3 messages ago.. We already have them (I did say UPS or
BBU, it should have been a logical 'and' instead of logical 'or' .. my
bad ;). Your right though, that was a no-brainer as well.

    I am wondering how the card (3ware 9550sx) will work with SSD's, md
or lvm, blocksize, ext3 or ext4 .. but.. this is the point of
benchmarking ;)

    Regards
    Stef

Re: Raid 10 chunksize

From
Greg Smith
Date:
On Wed, 1 Apr 2009, Scott Marlowe wrote:

> Meteor strike is far less likely than a power surge taking out a UPS.

I average having a system go down during a power outage because the UPS it
was attached to wasn't working right anymore about once every five years.
And I don't usually manage that many systems.

The only real way to know if a UPS is working right is to actually detach
power and confirm the battery still works, which is downtime nobody ever
feels is warranted for a production system.  Then, one day the power dies,
the UPS battery doesn't work to spec anymore, and you're done.

Of course, I have a BBC controller in my home desktop, so that gives you
an idea where I'm at as far as paranoia here goes.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Raid 10 chunksize

From
Matthew Wakeling
Date:
On Wed, 1 Apr 2009, Greg Smith wrote:
> The only real way to know if a UPS is working right is to actually detach
> power and confirm the battery still works, which is downtime nobody ever
> feels is warranted for a production system.  Then, one day the power dies,
> the UPS battery doesn't work to spec anymore, and you're done.

Most decent servers have dual power supplies, and they should really be
connected to two independent UPS units. You can test them one by one
without much risk of bringing down your server.

Matthew

--
 Okay, I'm weird! But I'm saving up to be eccentric.

Re: Raid 10 chunksize

From
Scott Marlowe
Date:
On Wed, Apr 1, 2009 at 11:54 AM, Matthew Wakeling <matthew@flymine.org> wrote:
> On Wed, 1 Apr 2009, Greg Smith wrote:
>>
>> The only real way to know if a UPS is working right is to actually detach
>> power and confirm the battery still works, which is downtime nobody ever
>> feels is warranted for a production system.  Then, one day the power dies,
>> the UPS battery doesn't work to spec anymore, and you're done.
>
> Most decent servers have dual power supplies, and they should really be
> connected to two independent UPS units. You can test them one by one without
> much risk of bringing down your server.

Yeah, our primary DB servers have three PSes and can run on any two
just fine.  We have three power busses each coming from a different
UPS at the hosting center.

Re: Raid 10 chunksize

From
Stef Telford
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Stef Telford wrote:
> Mark Kirkwood wrote:
>> Scott Carey wrote:
>>> A little extra info here >>  md, LVM, and some other tools do
>>> not allow the file system to use write barriers properly.... So
>>>  those are on the bad list for data integrity with SAS or SATA
>>> write caches without battery back-up. However, this is NOT an
>>> issue on the postgres data partition.  Data fsync still works
>>> fine, its the file system journal that might have out-of-order
>>> writes.  For xlogs, write barriers are not important, only
>>> fsync() not lying.
>>>
>>> As an additional note, ext4 uses checksums per block in the
>>> journal, so it is resistant to out of order writes causing
>>> trouble.  The test compared to here was on ext4, and most
>>> likely the speed increase is partly due to that.
>>>
>>>
>> [Looks at  Stef's  config - 2x 7200 rpm SATA RAID 0]  I'm still
>> highly suspicious of such a system being capable of outperforming
>>  one with the same number of (effective) - much faster - disks
>> *plus* a dedicated WAL disk pair... unless it is being a little
>> loose about fsync! I'm happy to believe ext4 is better than ext3
>> - but not that much!
>
>> However, its great to have so many different results to compare
>> against!
>
>> Cheers
>
>> Mark
>
> postgres@rob-desktop:~$ /usr/lib/postgresql/8.3/bin/pgbench -c 24
> -t 12000 test_db starting vacuum...end. transaction type: TPC-B
> (sort of) scaling factor: 100 number of clients: 24 number of
> transactions per client: 12000 number of transactions actually
> processed: 288000/288000 tps = 3662.200088 (including connections
> establishing) tps = 3664.823769 (excluding connections
> establishing)
>
>
> (Nb; Thread here;
> http://www.ocztechnologyforum.com/forum/showthread.php?t=54038 )
Fyi, I got my intel x25-m in the mail, and I have been benching it for
the past hour or so. Here are some of the rough and ready figures.
Note that I don't get anywhere near the vertex benchmark. I did
hotplug it and made the filesystem using Theodore Ts'o webpage
directions (
http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/
) ; The only thing is, ext3/4 seems to be fixated on a blocksize of
4k, I am wondering if this could be part of the 'problem'. Any
ideas/thoughts on tuning gratefully received.

Anyway, benchmarks (same system as previously, etc)

(ext4dev, 4k block size, pg_xlog on 2x7.2krpm raid-0, rest on SSD)

root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 1407.254118 (including connections establishing)
tps = 1407.645996 (excluding connections establishing)

(ext4dev, 4k block size, everything on SSD)

root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 2130.734705 (including connections establishing)
tps = 2131.545519 (excluding connections establishing)

(I wanted to try and see if random_page_cost dropped down to 2.0,
sequential_page_cost = 2.0 would make a difference. Eg; making the
planner aware that a random was the same cost as a sequential)

root@debian:/var/lib/postgresql/8.3/main#
/usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 1982.481185 (including connections establishing)
tps = 1983.223281 (excluding connections establishing)


Regards
Stef
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknTxccACgkQANG7uQ+9D9XoPgCfRwWwh0jTIs1iDQBVVdQJW/JN
CBcAn3zoOO33BnYC/FgmFzw1I+isWvJh
=0KYa
-----END PGP SIGNATURE-----


Re: Raid 10 chunksize

From
david@lang.hm
Date:
On Wed, 1 Apr 2009, Mark Kirkwood wrote:

> Scott Carey wrote:
>>
>> A little extra info here >>  md, LVM, and some other tools do not allow the
>> file system to use write barriers properly.... So those are on the bad list
>> for data integrity with SAS or SATA write caches without battery back-up.
>> However, this is NOT an issue on the postgres data partition.  Data fsync
>> still works fine, its the file system journal that might have out-of-order
>> writes.  For xlogs, write barriers are not important, only fsync() not
>> lying.
>>
>> As an additional note, ext4 uses checksums per block in the journal, so it
>> is resistant to out of order writes causing trouble.  The test compared to
>> here was on ext4, and most likely the speed increase is partly due to that.
>>
>>
>
> [Looks at  Stef's  config - 2x 7200 rpm SATA RAID 0]  I'm still highly
> suspicious of such a system being capable of outperforming one with the same
> number of (effective) - much faster - disks *plus* a dedicated WAL disk
> pair... unless it is being a little loose about fsync! I'm happy to believe
> ext4 is better than ext3 - but not that much!

given how _horrible_ ext3 is with fsync, I can belive it more easily with
fsync turned on than with it off.

David Lang

> However, its great to have so many different results to compare against!
>
> Cheers
>
> Mark
>
>
>

Re: Raid 10 chunksize

From
Stef Telford
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Stef Telford wrote:
> Stef Telford wrote:
>> Mark Kirkwood wrote:
>>> Scott Carey wrote:
>>>> A little extra info here >>  md, LVM, and some other tools do
>>>>  not allow the file system to use write barriers properly....
>>>> So those are on the bad list for data integrity with SAS or
>>>> SATA write caches without battery back-up. However, this is
>>>> NOT an issue on the postgres data partition.  Data fsync
>>>> still works fine, its the file system journal that might have
>>>> out-of-order writes.  For xlogs, write barriers are not
>>>> important, only fsync() not lying.
>>>>
>>>> As an additional note, ext4 uses checksums per block in the
>>>> journal, so it is resistant to out of order writes causing
>>>> trouble.  The test compared to here was on ext4, and most
>>>> likely the speed increase is partly due to that.
>>>>
>>>>
>>> [Looks at  Stef's  config - 2x 7200 rpm SATA RAID 0]  I'm still
>>>  highly suspicious of such a system being capable of
>>> outperforming one with the same number of (effective) - much
>>> faster - disks *plus* a dedicated WAL disk pair... unless it is
>>> being a little loose about fsync! I'm happy to believe ext4 is
>>> better than ext3 - but not that much! However, its great to
>>> have so many different results to compare against! Cheers Mark
>> postgres@rob-desktop:~$ /usr/lib/postgresql/8.3/bin/pgbench -c 24
>>  -t 12000 test_db starting vacuum...end. transaction type: TPC-B
>> (sort of) scaling factor: 100 number of clients: 24 number of
>> transactions per client: 12000 number of transactions actually
>> processed: 288000/288000 tps = 3662.200088 (including connections
>>  establishing) tps = 3664.823769 (excluding connections
>> establishing)
>
>
>> (Nb; Thread here;
>> http://www.ocztechnologyforum.com/forum/showthread.php?t=54038 )
> Fyi, I got my intel x25-m in the mail, and I have been benching it
> for the past hour or so. Here are some of the rough and ready
> figures. Note that I don't get anywhere near the vertex benchmark.
> I did hotplug it and made the filesystem using Theodore Ts'o
> webpage directions (
> http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/
>  ) ; The only thing is, ext3/4 seems to be fixated on a blocksize
> of 4k, I am wondering if this could be part of the 'problem'. Any
> ideas/thoughts on tuning gratefully received.
>
> Anyway, benchmarks (same system as previously, etc)
>
> (ext4dev, 4k block size, pg_xlog on 2x7.2krpm raid-0, rest on SSD)
>
> root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000
> test_db starting vacuum...end. transaction type: TPC-B (sort of)
> scaling factor: 100 number of clients: 24 number of transactions
> per client: 12000 number of transactions actually processed:
> 288000/288000 tps = 1407.254118 (including connections
> establishing) tps = 1407.645996 (excluding connections
> establishing)
>
> (ext4dev, 4k block size, everything on SSD)
>
> root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000
> test_db starting vacuum...end. transaction type: TPC-B (sort of)
> scaling factor: 100 number of clients: 24 number of transactions
> per client: 12000 number of transactions actually processed:
> 288000/288000 tps = 2130.734705 (including connections
> establishing) tps = 2131.545519 (excluding connections
> establishing)
>
> (I wanted to try and see if random_page_cost dropped down to 2.0,
> sequential_page_cost = 2.0 would make a difference. Eg; making the
> planner aware that a random was the same cost as a sequential)
>
> root@debian:/var/lib/postgresql/8.3/main#
> /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db starting
> vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100
>  number of clients: 24 number of transactions per client: 12000
> number of transactions actually processed: 288000/288000 tps =
> 1982.481185 (including connections establishing) tps = 1983.223281
> (excluding connections establishing)
>
>
> Regards Stef

Here is the single x25-m SSD, write cache -disabled-, XFS, noatime
mounted using the no-op scheduler;

stef@debian:~$ sudo /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000
test_db
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 1427.781843 (including connections establishing)
tps = 1428.137858 (excluding connections establishing)

Regards
Stef
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknT0hEACgkQANG7uQ+9D9X8zQCfcJ+tRQ7Sh6/YQImPejfZr/h4
/QcAn0hZujC1+f+4tBSF8EhNgR6q44kc
=XzG/
-----END PGP SIGNATURE-----


Re: Raid 10 chunksize

From
david@lang.hm
Date:
On Wed, 1 Apr 2009, david@lang.hm wrote:

> On Wed, 1 Apr 2009, Mark Kirkwood wrote:
>
>> Scott Carey wrote:
>>>
>>> A little extra info here >>  md, LVM, and some other tools do not allow
>>> the
>>> file system to use write barriers properly.... So those are on the bad
>>> list
>>> for data integrity with SAS or SATA write caches without battery back-up.
>>> However, this is NOT an issue on the postgres data partition.  Data fsync
>>> still works fine, its the file system journal that might have out-of-order
>>> writes.  For xlogs, write barriers are not important, only fsync() not
>>> lying.
>>>
>>> As an additional note, ext4 uses checksums per block in the journal, so it
>>> is resistant to out of order writes causing trouble.  The test compared to
>>> here was on ext4, and most likely the speed increase is partly due to
>>> that.
>>>
>>>
>>
>> [Looks at  Stef's  config - 2x 7200 rpm SATA RAID 0]  I'm still highly
>> suspicious of such a system being capable of outperforming one with the
>> same number of (effective) - much faster - disks *plus* a dedicated WAL
>> disk pair... unless it is being a little loose about fsync! I'm happy to
>> believe ext4 is better than ext3 - but not that much!
>
> given how _horrible_ ext3 is with fsync, I can belive it more easily with
> fsync turned on than with it off.

I realized after sending this that I needed to elaborate a little more.

over the last week there has been a _huge_ thread on the linux-kernel list
(>400 messages) that is summarized on lwn.net at
http://lwn.net/SubscriberLink/326471/b7f5fedf0f7c545f/

there is a lot of information in this thread, but one big thing is that in
data=ordered mode (the default for most distros) ext3 can end up having to
write all pending data when you do a fsync on one file, In addition
reading from disk can take priority over writing the journal entry (the IO
scheduler assumes that there is someone waiting for a read, but not for a
write), so if you have one process trying to do a fsync and another
reading from the disk, the one doing the fsync needs to wait until the
disk is idle to get the fsync completed.

ext4 does things enough differently that fsyncs are relativly cheap again
(like they are on XFS, ext2, and other filesystems). the tradeoff is that
if you _don't_ do an fsync there is a increased window where you will get
data corruption if you crash.

David Lang

Re: Raid 10 chunksize

From
Scott Carey
Date:
On 4/1/09 10:01 AM, "Matthew Wakeling" <matthew@flymine.org> wrote:

> On Wed, 1 Apr 2009, Stef Telford wrote:
>>    Good UPS, a warm PITR standby, offsite backups and regular checks is
>> "good enough" for me, and really, that's what it all comes down to.
>> Mitigating risk and factors into an 'acceptable' amount for each person.
>> However, if you see over a 2x improvement from turning write-cache 'on'
>> and have everything else in place, well, that seems like a 'no-brainer'
>> to me, at least ;)
>
> In that case, buying a battery-backed-up cache in the RAID controller
> would be even more of a no-brainer.
>
> Matthew
>

Why? Honestly, SATA write cache is safer than a battery backed raid card.
The raid card is one more point of failure, and SATA write caches with a
modern file system is safe.


> --
>  If pro is the opposite of con, what is the opposite of progress?
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: Raid 10 chunksize

From
Scott Carey
Date:
On 4/1/09 9:54 AM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:

> On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <stef@ummon.com> wrote:
>> Scott Marlowe wrote:
>>> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote:
>>>
>>>>     I do agree that the benefit is probably from write-caching, but I
>>>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>>>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>>>>
>>>
>>> You do know that UPSes can fail, right?  En masse sometimes even.
>>>
>> Hello Scott,
>>    Well, the only time the UPS has failed in my memory, was during the
>> great Eastern Seaboard power outage of 2003. Lots of fond memories
>> running around Toronto with a gas can looking for oil for generator
>> power. This said though, anything could happen, the co-lo could be taken
>> out by a meteor and then sync on or off makes no difference.
>
> Meteor strike is far less likely than a power surge taking out a UPS.
> I saw a whole data center go black when a power conditioner blew out,
> taking out the other three power conditioners, both industrial UPSes
> and the switch for the diesel generator.  And I have friends who have
> seen the same type of thing before as well.  The data is the most
> expensive part of any server.
>
Yeah, well I¹ve had a RAID card die, which broke its Battery backed cache.
They¹re all unsafe, technically.

In fact, not only are battery backed caches unsafe, but hard drives.  They
can return bad data.  So if you want to be really safe:

1: don't use Linux -- you have to use something with full data and metadata
checksums like ZFS or very expensive proprietary file systems.
2: combine it with mirrored SSD's that don't use write cache (so you can
have fsync perf about as good as a battery backed raid card without that
risk).
4: keep a live redundant system with a PITR backup at another site that can
recover in a short period of time.
3: Run in a datacenter well underground with a plutonium nuclear power
supply.  Meteor strikes and Nuclear holocaust, beware!


Re: Raid 10 chunksize

From
Scott Carey
Date:
On 4/1/09 9:15 AM, "Stef Telford" <stef@ummon.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Greg Smith wrote:
>> On Wed, 1 Apr 2009, Stef Telford wrote:
>>
>>> I have -explicitly- enabled sync in the conf...In fact, if I turn
>>>  -off- sync commit, it gets about 200 -slower- rather than
>>> faster.
>>
>> You should take a look at
>> http://www.postgresql.org/docs/8.3/static/wal-reliability.html
>>
>> And check the output from "hdparm -I" as suggested there.  If
>> turning off fsync doesn't improve your performance, there's almost
>> certainly something wrong with your setup.  As suggested before,
>> your drives probably have write caching turned on.  PostgreSQL is
>> incapable of knowing that, and will happily write in an unsafe
>> manner even if the fsync parameter is turned on.  There's a bunch
>> more information on this topic at
>> http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm
>>
>> Also:  a run to run variation in pgbench results of +/-10% TPS is
>> normal, so unless you saw a consistent 200 TPS gain during multiple
>>  tests my guess is that changing fsync for you is doing nothing,
>> rather than you suggestion that it makes things slower.
>>
> Hello Greg,
>     Turning off fsync -does- increase the throughput noticeably,
> - -however-, turning off synchronous_commit seemed to slow things down
> for me. Your right though, when I toggled the sync_commit on the
> system, there was a small variation with TPS coming out between 1100
> and 1300. I guess I saw the initial run and thought that there was a
> 'loss' in sync_commit = off
>
>      I do agree that the benefit is probably from write-caching, but I
> think that this is a 'win' as long as you have a UPS or BBU adaptor,
> and really, in a prod environment, not having a UPS is .. well. Crazy ?

Write caching on SATA is totally fine.  There were some old ATA drives that
when paried with some file systems or OS's would not be safe.  There are
some combinations that have unsafe write barriers.  But there is a standard
well supported ATA command to sync and only return after the data is on
disk.  If you are running an OS that is anything recent at all, and any
disks that are not really old, you're fine.

The notion that current SATA systems are unsafe to have write caching (or
SAS for that matter) is not fully informed.  You have to pair it with a file
system and OS that doesn't issue the necessary cache flush commands to sync.




Re: Raid 10 chunksize

From
Scott Carey
Date:
On 4/1/09 1:44 PM, "Stef Telford" <stef@ummon.com> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Stef Telford wrote:
>> Stef Telford wrote:
>> Fyi, I got my intel x25-m in the mail, and I have been benching it
>> for the past hour or so. Here are some of the rough and ready
>> figures. Note that I don't get anywhere near the vertex benchmark.
>> I did hotplug it and made the filesystem using Theodore Ts'o
>> webpage directions (
>> http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-
>> block-size/
>>  ) ; The only thing is, ext3/4 seems to be fixated on a blocksize
>> of 4k, I am wondering if this could be part of the 'problem'. Any
>> ideas/thoughts on tuning gratefully received.
>>
>> Anyway, benchmarks (same system as previously, etc)
>>
>> (ext4dev, 4k block size, pg_xlog on 2x7.2krpm raid-0, rest on SSD)
>>
>> root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000
>> test_db starting vacuum...end. transaction type: TPC-B (sort of)
>> scaling factor: 100 number of clients: 24 number of transactions
>> per client: 12000 number of transactions actually processed:
>> 288000/288000 tps = 1407.254118 (including connections
>> establishing) tps = 1407.645996 (excluding connections
>> establishing)
>>
>> (ext4dev, 4k block size, everything on SSD)
>>
>> root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000
>> test_db starting vacuum...end. transaction type: TPC-B (sort of)
>> scaling factor: 100 number of clients: 24 number of transactions
>> per client: 12000 number of transactions actually processed:
>> 288000/288000 tps = 2130.734705 (including connections
>> establishing) tps = 2131.545519 (excluding connections
>> establishing)
>>
>> (I wanted to try and see if random_page_cost dropped down to 2.0,
>> sequential_page_cost = 2.0 would make a difference. Eg; making the
>> planner aware that a random was the same cost as a sequential)
>>
>> root@debian:/var/lib/postgresql/8.3/main#
>> /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db starting
>> vacuum...end. transaction type: TPC-B (sort of) scaling factor: 100
>>  number of clients: 24 number of transactions per client: 12000
>> number of transactions actually processed: 288000/288000 tps =
>> 1982.481185 (including connections establishing) tps = 1983.223281
>> (excluding connections establishing)
>>
>>
>> Regards Stef
>
> Here is the single x25-m SSD, write cache -disabled-, XFS, noatime
> mounted using the no-op scheduler;
>
> stef@debian:~$ sudo /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000
> test_db
> starting vacuum...end.
> transaction type: TPC-B (sort of)
> scaling factor: 100
> number of clients: 24
> number of transactions per client: 12000
> number of transactions actually processed: 288000/288000
> tps = 1427.781843 (including connections establishing)
> tps = 1428.137858 (excluding connections establishing)


Ok, in my experience the next step to better performance on this setup in
situations not involving pg_bench is to turn dirty_background_ratio down to
a very small number (1 or 2).  However, pg_bench relies quite a bit on the
OS postponing writes due to its quirkiness. Depending on the scaling factor
to memory ratio and how big shared_buffers is, results may vary.

So I'm not going to predict that that will help this particular case, but am
commenting that in general I have gotten the best throughput and lowest
latency with a low dirty_background_ratio and the noop scheduler when using
the Intel SSDs.  I've tried all the other scheduler and queue tunables,
without much result.  Increasing max_sectors_kb helped a bit in some cases,
but it seemed inconsistent.

The Vertex does some things differently that might be very good for postgres
(but bad for some other apps) as from what I've seen it prioritizes writes
more.

Furthermore, it has and uses a write cache from what I've read... The Intel
drives don't use a write cache at all (The RAM is for the LBA > Physical map
and management).  If the vertex is way faster, I would suspect that its
write cache may not be properly honoring cache flush commands.

I have an app where I wish to keep the read latency as low as possible while
doing a large batch write with the write at ~90% disk utilization, and the
Intels destroy everything else at that task so far.

And in all honesty, I trust the Intel's data integrity a lot more than OCZ
for now.

>
> Regards
> Stef
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAknT0hEACgkQANG7uQ+9D9X8zQCfcJ+tRQ7Sh6/YQImPejfZr/h4
> /QcAn0hZujC1+f+4tBSF8EhNgR6q44kc
> =XzG/
> -----END PGP SIGNATURE-----
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: Raid 10 chunksize

From
david@lang.hm
Date:
On Wed, 1 Apr 2009, Scott Carey wrote:

> On 4/1/09 9:54 AM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>
>> On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <stef@ummon.com> wrote:
>>> Scott Marlowe wrote:
>>>> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote:
>>>>
>>>>>     I do agree that the benefit is probably from write-caching, but I
>>>>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>>>>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>>>>>
>>>>
>>>> You do know that UPSes can fail, right?  En masse sometimes even.
>>>>
>>> Hello Scott,
>>>    Well, the only time the UPS has failed in my memory, was during the
>>> great Eastern Seaboard power outage of 2003. Lots of fond memories
>>> running around Toronto with a gas can looking for oil for generator
>>> power. This said though, anything could happen, the co-lo could be taken
>>> out by a meteor and then sync on or off makes no difference.
>>
>> Meteor strike is far less likely than a power surge taking out a UPS.
>> I saw a whole data center go black when a power conditioner blew out,
>> taking out the other three power conditioners, both industrial UPSes
>> and the switch for the diesel generator.  And I have friends who have
>> seen the same type of thing before as well.  The data is the most
>> expensive part of any server.
>>
> Yeah, well I?ve had a RAID card die, which broke its Battery backed cache.
> They?re all unsafe, technically.
>
> In fact, not only are battery backed caches unsafe, but hard drives.  They
> can return bad data.  So if you want to be really safe:
>
> 1: don't use Linux -- you have to use something with full data and metadata
> checksums like ZFS or very expensive proprietary file systems.

this will involve other tradeoffs

> 2: combine it with mirrored SSD's that don't use write cache (so you can
> have fsync perf about as good as a battery backed raid card without that
> risk).

they _all_ have write caches. a beast like you are looking for doesn't
exist

> 4: keep a live redundant system with a PITR backup at another site that can
> recover in a short period of time.

a good option to keep in mind (and when the new replication code becomes
available, that will be even better)

> 3: Run in a datacenter well underground with a plutonium nuclear power
> supply.  Meteor strikes and Nuclear holocaust, beware!

at some point all that will fail

but you missed point #5 (in many ways a more important point than the
others that you describe)

switch from using postgres to using a database that can do two-phase
commits across redundant machines so that you know the data is safe on
multiple systems before the command is considered complete.

David Lang

Re: Raid 10 chunksize

From
Scott Marlowe
Date:
On Wed, Apr 1, 2009 at 4:15 PM, Scott Carey <scott@richrelevance.com> wrote:
>
> On 4/1/09 9:54 AM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>
>> On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <stef@ummon.com> wrote:
>>> Scott Marlowe wrote:
>>>> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <stef@ummon.com> wrote:
>>>>
>>>>>     I do agree that the benefit is probably from write-caching, but I
>>>>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>>>>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>>>>>
>>>>
>>>> You do know that UPSes can fail, right?  En masse sometimes even.
>>>>
>>> Hello Scott,
>>>    Well, the only time the UPS has failed in my memory, was during the
>>> great Eastern Seaboard power outage of 2003. Lots of fond memories
>>> running around Toronto with a gas can looking for oil for generator
>>> power. This said though, anything could happen, the co-lo could be taken
>>> out by a meteor and then sync on or off makes no difference.
>>
>> Meteor strike is far less likely than a power surge taking out a UPS.
>> I saw a whole data center go black when a power conditioner blew out,
>> taking out the other three power conditioners, both industrial UPSes
>> and the switch for the diesel generator.  And I have friends who have
>> seen the same type of thing before as well.  The data is the most
>> expensive part of any server.
>>
> Yeah, well I¹ve had a RAID card die, which broke its Battery backed cache.
> They¹re all unsafe, technically.

That's why you use two controllers with mirror sets across them and
them RAID-0 across the top.  But I know what you mean.  Now the mobo
and memory are the single point of failure.  Next stop, sequent etc.

> In fact, not only are battery backed caches unsafe, but hard drives.  They
> can return bad data.  So if you want to be really safe:
>
> 1: don't use Linux -- you have to use something with full data and metadata
> checksums like ZFS or very expensive proprietary file systems.

You'd better be running them on sequent or Sysplex mainframe type hardware.

> 4: keep a live redundant system with a PITR backup at another site that can
> recover in a short period of time.
> 3: Run in a datacenter well underground with a plutonium nuclear power
> supply.  Meteor strikes and Nuclear holocaust, beware!

Pleaze, such hyperbole!  Everyone know it can run on uranium just as
well.  I'm sure these guys:
http://royal.pingdom.com/2008/11/14/the-worlds-most-super-designed-data-center-fit-for-a-james-bond-villain/
can sort that out for you.

Re: Raid 10 chunksize

From
Mark Kirkwood
Date:
Stef Telford wrote:
>
> Hello Mark,
>     For the record, this is a 'base' debian 5 install (with openVZ but
> postgreSQL is running on the base hardware, not inside a container)
> and I have -explicitly- enabled sync in the conf. Eg;
>
>
> fsync = on                                            # turns forced
>
>
>     Infact, if I turn -off- sync commit, it gets about 200 -slower-
> rather than faster.
>
Sorry Stef - didn't mean to doubt you....merely your disks!

Cheers

Mark

Re: Raid 10 chunksize

From
Mark Kirkwood
Date:
Greg Smith wrote:
>
>> Yeah - with 64K chunksize I'm seeing a result more congruent with
>> yours (866 or so for 24 clients)
>
> That's good to hear.  If adjusting that helped so much, you might
> consider aligning the filesystem partitions to the chunk size too; the
> partition header usually screws that up on Linux.  See these two
> references for ideas:
> http://www.vmware.com/resources/techresources/608
> http://spiralbound.net/2008/06/09/creating-linux-partitions-for-clariion
>

Well I went away and did this (actually organized for for the system
folks to...). Retesting showed no appreciable difference (if anything
slower). Then I got to thinking:

For a partition created on a (hardware) raided device, sure - alignment
is very important, however in my case we are using software (md) raid -
which creates devices out of individual partitions (which are on
individual SAS disks) e.g:

md3 : active raid10 sda4[0] sdd4[3] sdc4[2] sdb4[1]
     177389056 blocks 256K chunks 2 near-copies [4/4] [UUUU]

I'm thinking that alignment issues do not apply here, as md will
allocate chunks starting at the beginning of wherever sda4 (etc) begins
- so the absolute starting position of sda4 is irrelevant. Or am I
missing something?

Thanks again

Mark


Re: Raid 10 chunksize

From
Greg Smith
Date:
On Wed, 1 Apr 2009, Scott Carey wrote:

> Write caching on SATA is totally fine.  There were some old ATA drives that
> when paried with some file systems or OS's would not be safe.  There are
> some combinations that have unsafe write barriers.  But there is a standard
> well supported ATA command to sync and only return after the data is on
> disk.  If you are running an OS that is anything recent at all, and any
> disks that are not really old, you're fine.

While I would like to believe this, I don't trust any claims in this area
that don't have matching tests that demonstrate things working as
expected.  And I've never seen this work.

My laptop has a 7200 RPM drive, which means that if fsync is being passed
through to the disk correctly I can only fsync <120 times/second.  Here's
what I get when I run sysbench on it, starting with the default ext3
configuration:

$ uname -a
Linux gsmith-t500 2.6.28-11-generic #38-Ubuntu SMP Fri Mar 27 09:00:52 UTC 2009 i686 GNU/Linux

$ mount
/dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro)

$ sudo hdparm -I /dev/sda | grep FLUSH
        *    Mandatory FLUSH_CACHE
        *    FLUSH_CACHE_EXT

$ ~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384
--file-test-mode=rndwrrun 
sysbench v0.4.8:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 16Kb each
16Kb total file size
Block size 16Kb
Number of random requests for random IO: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 1 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random write test
Threads started!
Done.

Operations performed:  0 Read, 10000 Write, 10000 Other = 20000 Total
Read 0b  Written 156.25Mb  Total transferred 156.25Mb  (39.176Mb/sec)
  2507.29 Requests/sec executed


OK, that's clearly cached writes where the drive is lying about fsync.
The claim is that since my drive supports both the flush calls, I just
need to turn on barrier support, right?

[Edit /etc/fstab to remount with barriers]

$ mount
/dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro,barrier=1)

[sysbench again]

  2612.74 Requests/sec executed

-----

This is basically how this always works for me:  somebody claims barriers
and/or SATA disks work now, no really this time.  I test, they give
answers that aren't possible if fsync were working properly, I conclude
turning off the write cache is just as necessary as it always was.  If you
can suggest something wrong with how I'm testing here, I'd love to hear
about it.  I'd like to believe you but I can't seem to produce any
evidence that supports you claims here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Raid 10 chunksize

From
Merlin Moncure
Date:
On Wed, Mar 25, 2009 at 12:16 PM, Scott Carey <scott@richrelevance.com> wrote:
> On 3/25/09 1:07 AM, "Greg Smith" <gsmith@gregsmith.com> wrote:
>> On Wed, 25 Mar 2009, Mark Kirkwood wrote:
>>> I'm thinking that the raid chunksize may well be the issue.
>>
>> Why?  I'm not saying you're wrong, I just don't see why that parameter
>> jumped out as a likely cause here.
>>
>
> If postgres is random reading or writing at 8k block size, and the raid
> array is set with 4k block size, then every 8k random i/o will create TWO
> disk seeks since it gets split to two disks.   Effectively, iops will be cut
> in half.

I disagree.  The 4k raid chunks are likely to be grouped together on
disk and read sequentially.  This will only give two seeks in special
cases.  Now, if the PostgreSQL block size is _smaller_ than the raid
chunk size,  random writes can get expensive (especially for raid 5)
because the raid chunk has to be fully read in and written back out.
But this is mainly a theoretical problem I think.

I'm going to go out on a limb and say that for block sizes that are
within one or two 'powers of two' of each other, it doesn't matter a
whole lot.  SSDs might be different, because of the 'erase' block
which might be 128k, but I bet this is dealt with in such a fashion
that you wouldn't really notice it when dealing with different block
sizes in pg.

merlin

Re: Raid 10 chunksize

From
James Mansion
Date:
Greg Smith wrote:
> OK, that's clearly cached writes where the drive is lying about fsync.
> The claim is that since my drive supports both the flush calls, I just
> need to turn on barrier support, right?
>
That's a big pointy finger you are aiming at that drive - are you sure
it was sent the flush instruction?  Clearly *something* isn't right.

> This is basically how this always works for me:  somebody claims
> barriers and/or SATA disks work now, no really this time.  I test,
> they give answers that aren't possible if fsync were working properly,
> I conclude turning off the write cache is just as necessary as it
> always was.  If you can suggest something wrong with how I'm testing
> here, I'd love to hear about it.  I'd like to believe you but I can't
> seem to produce any evidence that supports you claims here.
Try similar tests with Solaris and Vista?

(Might have to give the whole disk to ZFS with Solaris to give it
confidence to enable write cache, which mioght not be easy with a laptop
boot drive: XP and Vista should show the toggle on the drive)

James


Re: Raid 10 chunksize

From
Scott Carey
Date:
On 4/2/09 10:58 AM, "Merlin Moncure" <mmoncure@gmail.com> wrote:

> On Wed, Mar 25, 2009 at 12:16 PM, Scott Carey <scott@richrelevance.com> wrote:
>> On 3/25/09 1:07 AM, "Greg Smith" <gsmith@gregsmith.com> wrote:
>>> On Wed, 25 Mar 2009, Mark Kirkwood wrote:
>>>> I'm thinking that the raid chunksize may well be the issue.
>>>
>>> Why?  I'm not saying you're wrong, I just don't see why that parameter
>>> jumped out as a likely cause here.
>>>
>>
>> If postgres is random reading or writing at 8k block size, and the raid
>> array is set with 4k block size, then every 8k random i/o will create TWO
>> disk seeks since it gets split to two disks.   Effectively, iops will be cut
>> in half.
>
> I disagree.  The 4k raid chunks are likely to be grouped together on
> disk and read sequentially.  This will only give two seeks in special
> cases.

By definition, adjacent raid blocks in a stripe are on different disks.


> Now, if the PostgreSQL block size is _smaller_ than the raid
> chunk size,  random writes can get expensive (especially for raid 5)
> because the raid chunk has to be fully read in and written back out.
> But this is mainly a theoretical problem I think.

This is false and a RAID-5 myth.  New parity can be constructed from the old
parity + the change in data.  Only 2 blocks have to be accessed, not the
whole stripe.

Plus, this was about RAID 10 or 0 where parity does not apply.

>
> I'm going to go out on a limb and say that for block sizes that are
> within one or two 'powers of two' of each other, it doesn't matter a
> whole lot.  SSDs might be different, because of the 'erase' block
> which might be 128k, but I bet this is dealt with in such a fashion
> that you wouldn't really notice it when dealing with different block
> sizes in pg.

Well, raid block size can be significantly larger than postgres or file
system block size and the performance of random reads / writes won't get
worse with larger block sizes.  This holds only for RAID 0 (or 10), parity
is the ONLY thing that makes larger block sizes bad since there is a
read-modify-write type operation on something the size of one block.

Raid block sizes smaller than the postgres block is always bad and
multiplies random i/o.

Read a 8k postgres block in a 8MB md raid 0 block, and you read 8k from one
disk.
Read a 8k postgres block on a md raid 0 with 4k blocks, and you read 4k from
two disks.


Re: Raid 10 chunksize

From
Merlin Moncure
Date:
On Thu, Apr 2, 2009 at 4:20 PM, Scott Carey <scott@richrelevance.com> wrote:
>
> On 4/2/09 10:58 AM, "Merlin Moncure" <mmoncure@gmail.com> wrote:
>
>> On Wed, Mar 25, 2009 at 12:16 PM, Scott Carey <scott@richrelevance.com> wrote:
>>> On 3/25/09 1:07 AM, "Greg Smith" <gsmith@gregsmith.com> wrote:
>>>> On Wed, 25 Mar 2009, Mark Kirkwood wrote:
>>>>> I'm thinking that the raid chunksize may well be the issue.
>>>>
>>>> Why?  I'm not saying you're wrong, I just don't see why that parameter
>>>> jumped out as a likely cause here.
>>>>
>>>
>>> If postgres is random reading or writing at 8k block size, and the raid
>>> array is set with 4k block size, then every 8k random i/o will create TWO
>>> disk seeks since it gets split to two disks.   Effectively, iops will be cut
>>> in half.
>>
>> I disagree.  The 4k raid chunks are likely to be grouped together on
>> disk and read sequentially.  This will only give two seeks in special
>> cases.
>
> By definition, adjacent raid blocks in a stripe are on different disks.
>
>
>> Now, if the PostgreSQL block size is _smaller_ than the raid
>> chunk size,  random writes can get expensive (especially for raid 5)
>> because the raid chunk has to be fully read in and written back out.
>> But this is mainly a theoretical problem I think.
>
> This is false and a RAID-5 myth.  New parity can be constructed from the old
> parity + the change in data.  Only 2 blocks have to be accessed, not the
> whole stripe.
>
> Plus, this was about RAID 10 or 0 where parity does not apply.
>
>>
>> I'm going to go out on a limb and say that for block sizes that are
>> within one or two 'powers of two' of each other, it doesn't matter a
>> whole lot.  SSDs might be different, because of the 'erase' block
>> which might be 128k, but I bet this is dealt with in such a fashion
>> that you wouldn't really notice it when dealing with different block
>> sizes in pg.
>
> Well, raid block size can be significantly larger than postgres or file
> system block size and the performance of random reads / writes won't get
> worse with larger block sizes.  This holds only for RAID 0 (or 10), parity
> is the ONLY thing that makes larger block sizes bad since there is a
> read-modify-write type operation on something the size of one block.
>
> Raid block sizes smaller than the postgres block is always bad and
> multiplies random i/o.
>
> Read a 8k postgres block in a 8MB md raid 0 block, and you read 8k from one
> disk.
> Read a 8k postgres block on a md raid 0 with 4k blocks, and you read 4k from
> two disks.

yep...that's good analysis...thinko on my part.

merlin

Re: Raid 10 chunksize

From
Scott Carey
Date:
On 4/2/09 1:53 AM, "Greg Smith" <gsmith@gregsmith.com> wrote:

> On Wed, 1 Apr 2009, Scott Carey wrote:
>
>> Write caching on SATA is totally fine.  There were some old ATA drives that
>> when paried with some file systems or OS's would not be safe.  There are
>> some combinations that have unsafe write barriers.  But there is a standard
>> well supported ATA command to sync and only return after the data is on
>> disk.  If you are running an OS that is anything recent at all, and any
>> disks that are not really old, you're fine.
>
> While I would like to believe this, I don't trust any claims in this area
> that don't have matching tests that demonstrate things working as
> expected.  And I've never seen this work.
>
> My laptop has a 7200 RPM drive, which means that if fsync is being passed
> through to the disk correctly I can only fsync <120 times/second.  Here's
> what I get when I run sysbench on it, starting with the default ext3
> configuration:
>
> $ uname -a
> Linux gsmith-t500 2.6.28-11-generic #38-Ubuntu SMP Fri Mar 27 09:00:52 UTC
> 2009 i686 GNU/Linux
>
> $ mount
> /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro)
>
> $ sudo hdparm -I /dev/sda | grep FLUSH
>            *    Mandatory FLUSH_CACHE
>            *    FLUSH_CACHE_EXT
>
> $ ~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1
> --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run
> sysbench v0.4.8:  multi-threaded system evaluation benchmark
>
> Running the test with following options:
> Number of threads: 1
>
> Extra file open flags: 0
> 1 files, 16Kb each
> 16Kb total file size
> Block size 16Kb
> Number of random requests for random IO: 10000
> Read/Write ratio for combined random IO test: 1.50
> Periodic FSYNC enabled, calling fsync() each 1 requests.
> Calling fsync() at the end of test, Enabled.
> Using synchronous I/O mode
> Doing random write test
> Threads started!
> Done.
>
> Operations performed:  0 Read, 10000 Write, 10000 Other = 20000 Total
> Read 0b  Written 156.25Mb  Total transferred 156.25Mb  (39.176Mb/sec)
>   2507.29 Requests/sec executed
>
>
> OK, that's clearly cached writes where the drive is lying about fsync.
> The claim is that since my drive supports both the flush calls, I just
> need to turn on barrier support, right?
>
> [Edit /etc/fstab to remount with barriers]
>
> $ mount
> /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro,barrier=1)
>
> [sysbench again]
>
>   2612.74 Requests/sec executed
>
> -----
>
> This is basically how this always works for me:  somebody claims barriers
> and/or SATA disks work now, no really this time.  I test, they give
> answers that aren't possible if fsync were working properly, I conclude
> turning off the write cache is just as necessary as it always was.  If you
> can suggest something wrong with how I'm testing here, I'd love to hear
> about it.  I'd like to believe you but I can't seem to produce any
> evidence that supports you claims here.

Your data looks good, and puts a lot of doubt on my previous sources of
info.
So I did more research, it seems that (most) drives don't lie, your OS and
file system do (or sometimes drive drivers or raid card).  I know LVM and MD
and other Linux block remapping layer things break write barriers as well.
Apparently ext3 doesn't implement fsync with a write barrier or cache flush.
Linux kernel mailing lists implied that 2.6 had fixed these, but apparently
not.  Write barriers were fixed, but not fsync.  Even more confusing, it
looks like the behavior in some linux versions that are highly patched and
backported (SUSE, RedHat, mostly) may behave differently than those closer
to the kernel trunk like Ubuntu.

If you can, try xfs with write barriers on.  I'll try some tests using FIO
(not familiar with sysbench but looks easy too) with various file systems
and some SATA and SAS/SCSI setups when I get a chance.

A lot of my prior evidence came from the linux kernel list and other places
where I trusted the info over the years.  I'll dig up more. But here is what
I've learned in the past plus a bit from today:
Drives don't lie anymore, and write barrier and lower level ATA commands
just work.  Linux fixed write barrier support in kernel 2.5.
Several OS's do things right and many don't with respect to fsync.  I had
thought linux did fix this but it turns out they only fixed write barriers
and left fsync broken:
http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/987024/thread

In your tests the barriers slowed things down a lot, so something is working
right there.  From what I can see, with ext3 metadata changes cause much
more frequent write barrier activity, so 'relatime' and 'noatime' actually
HURT your data integrity as a side effect of fsync not guaranteeing what you
think it does.

The big one, is this quote from the linux kernel list:
" Right now, if you want a reliable database on Linux, you _cannot_
properly depend on fsync() or fdatasync().  Considering how much Linux
is used for critical databases, using these functions, this amazes me.
"

Check this full post out that started that thread:
http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/987024


I admit that it looks like I'm pretty wrong for Linux with ext3 at the
least.
Linux is often not safe with disk write caches because its fsync() call
doesn't flush the cache.  The root problem, is not the drives, its linux /
ext3.  Its write-barrier support is fine now (if you don't go through LVM or
MD which don't support it), but fsync does not guarantee anything other than
the write having left the OS and gone to the device. In fact POSIX fsync(2)
doesn't require that the data is on disk.  Interestingly, postgres would be
safer on linux if it used sync_file_range instead of fsync() but that has
other drawbacks and limitations -- and is broken by use of LVM or MD.
Currently, linux + ext3 + postgres, you are only guaranteed when fsync()
returns that the data has left the OS, not that it is on a drive -- SATA or
SAS.  Strangely, sync_file_range() is safer than fsync() in the presence of
any drive cache at all (including battery backed raid card failure) because
it at least enforces write barriers.

Fsync + SATA write cache is safe on Solaris with ZFS, but not Solaris with
UFS (the file system is write barrier and cache aware for the former and not
the latter).

Linux (a lot) and Postgres (a little) can learn from some of the ZFS
concepts with regard to atomicity of changes and checksums on data and
metadata.  Much of the above issues would simply not exist in the presence
of good checksum use.  Ext4 has journal segment checksums, but no metadata
or data checksums exist for ability to detect partial writes to anything but
the journal.  Postgres is adding checksums on data, and is already
essentially copy-on-write for MVCC which is awesome -- are xlog writes
protected by checksums?  Accidental out-of-order writes become an issue that
can be dealt with in a log or journal that has checksums even in the
presence of OS and File Systems that don't have good guarantees for fsync
like Linux + ext3.  Postgres could make itself safe even if drive write
cache is enabled, fsync lies, AND there is a power failure.  If I'm not
mistaken, block checksums on data + xlog entry checksums can make it very
difficult to corrupt even if fsync is off (though data writes happening
before xlog writes are still bad -- that would require external-to-block
checksums --like zfs -- to fix)!


http://lkml.org/lkml/2005/5/15/85

Where the "disks lie to you" stuff probably came from:
http://hardware.slashdot.org/article.pl?sid=05/05/13/0529252&tid=198&tid=128
(turns out its the OS that isn't flushing the cache on fsync).

http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_with_the_write_cache
_on_journaled_filesystems.3F
So if xfs fsync has a barrier, its safe with either:
Raw device that respects cache flush + write caching on.
OR
Battery backed raid card + drive write caching off.

Xfs fsync supposedly works right (need to test) but fdatasync() does not.


What this really boils down to is that POSIX fsync does not provide a
guarantee that the data is on disk at all.  My previous comments are wrong.
This means that fsync protects you from OS crashes, but not power failure.
It can do better in some systems / implementations.




>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>


Re: Raid 10 chunksize

From
Scott Carey
Date:
On 4/2/09 1:20 PM, "Scott Carey" <scott@richrelevance.com> wrote:
>
> Well, raid block size can be significantly larger than postgres or file
> system block size and the performance of random reads / writes won't get
> worse with larger block sizes.  This holds only for RAID 0 (or 10), parity
> is the ONLY thing that makes larger block sizes bad since there is a
> read-modify-write type operation on something the size of one block.
>
> Raid block sizes smaller than the postgres block is always bad and
> multiplies random i/o.
>
> Read a 8k postgres block in a 8MB md raid 0 block, and you read 8k from one
> disk.
> Read a 8k postgres block on a md raid 0 with 4k blocks, and you read 4k from
> two disks.
>


OK, one more thing.  The 8k read In a 8MB block size raid array can generate
two reads in the following cases:

Your read is on the boundary of the blocks AND

1: your partition is not aligned with the raid blocks.  This can happen if
you partition _inside_ the raid but not if you raid inside the partition
(the latter only being applicable to software raid).
OR
2:  your file system block size is smaller than the postgres block size and
the file block offset is not postgres block aligned.

The likelihood of the first condition is proportional to:

(Postgres block size)/(raid block size)

Hence, for most all setups with software raid, a larger block size up to the
point where the above ratio gets sufficiently small is optimal.  If the
block size gets too large, then random access is more and more likely to
bias towards one drive over the others and lower throughput.

Obviously, in the extreme case where the block size is the disk size, you
would have to randomly access 100% of all the data to get full speed.


Re: Raid 10 chunksize

From
Ron Mayer
Date:
Greg Smith wrote:
> On Wed, 1 Apr 2009, Scott Carey wrote:
>
>> Write caching on SATA is totally fine.  There were some old ATA drives
>> that when paried with some file systems or OS's would not be safe.  There are
>> some combinations that have unsafe write barriers.  But there is a
>> standard
>> well supported ATA command to sync and only return after the data is on
>> disk.  If you are running an OS that is anything recent at all, and any
>> disks that are not really old, you're fine.
>
> While I would like to believe this, I don't trust any claims in this
> area that don't have matching tests that demonstrate things working as
> expected.  And I've never seen this work.
>
> My laptop has a 7200 RPM drive, which means that if fsync is being
> passed through to the disk correctly I can only fsync <120
> times/second.  Here's what I get when I run sysbench on it, starting
> with the default ext3 configuration:

I believe it's ext3 who's cheating in this scenario.

Any chance you can test the program I posted here that
tweaks the inode before the fsync:
http://archives.postgresql.org//pgsql-general/2009-03/msg00703.php

On my system with the fchmod's in that program I was getting one
fsync per disk revolution.   Without the fchmod's, fsync() didn't
wait at all.

This was the case on dozens of drives I tried, dating back to
old PATA drives from 2000.  Only drives from last century didn't
behave that way - but I can't accuse them of lying because
hdparm showed that they didn't claim to support FLUSH_CACHE.


I think this program shows that practically all hard drives are
physically capable of doing a proper fsync; but annoyingly
ext3 refuses to send the FLUSH_CACHE commands to the drive
unless the inode changed.


> $ uname -a
> Linux gsmith-t500 2.6.28-11-generic #38-Ubuntu SMP Fri Mar 27 09:00:52
> UTC 2009 i686 GNU/Linux
>
> $ mount
> /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro)
>
> $ sudo hdparm -I /dev/sda | grep FLUSH
>        *    Mandatory FLUSH_CACHE
>        *    FLUSH_CACHE_EXT
>
> $ ~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1
> --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run
> sysbench v0.4.8:  multi-threaded system evaluation benchmark
>
> Running the test with following options:
> Number of threads: 1
>
> Extra file open flags: 0
> 1 files, 16Kb each
> 16Kb total file size
> Block size 16Kb
> Number of random requests for random IO: 10000
> Read/Write ratio for combined random IO test: 1.50
> Periodic FSYNC enabled, calling fsync() each 1 requests.
> Calling fsync() at the end of test, Enabled.
> Using synchronous I/O mode
> Doing random write test
> Threads started!
> Done.
>
> Operations performed:  0 Read, 10000 Write, 10000 Other = 20000 Total
> Read 0b  Written 156.25Mb  Total transferred 156.25Mb  (39.176Mb/sec)
>  2507.29 Requests/sec executed
>
>
> OK, that's clearly cached writes where the drive is lying about fsync.
> The claim is that since my drive supports both the flush calls, I just
> need to turn on barrier support, right?
>
> [Edit /etc/fstab to remount with barriers]
>
> $ mount
> /dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro,barrier=1)
>
> [sysbench again]
>
>  2612.74 Requests/sec executed
>
> -----
>
> This is basically how this always works for me:  somebody claims
> barriers and/or SATA disks work now, no really this time.  I test, they
> give answers that aren't possible if fsync were working properly, I
> conclude turning off the write cache is just as necessary as it always
> was.  If you can suggest something wrong with how I'm testing here, I'd
> love to hear about it.  I'd like to believe you but I can't seem to
> produce any evidence that supports you claims here.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>


Re: Raid 10 chunksize

From
Hannes Dorbath
Date:
Ron Mayer wrote:
> Greg Smith wrote:
>> On Wed, 1 Apr 2009, Scott Carey wrote:
>>
>>> Write caching on SATA is totally fine.  There were some old ATA drives
>>> that when paried with some file systems or OS's would not be safe.  There are
>>> some combinations that have unsafe write barriers.  But there is a
>>> standard
>>> well supported ATA command to sync and only return after the data is on
>>> disk.  If you are running an OS that is anything recent at all, and any
>>> disks that are not really old, you're fine.
>> While I would like to believe this, I don't trust any claims in this
>> area that don't have matching tests that demonstrate things working as
>> expected.  And I've never seen this work.
>>
>> My laptop has a 7200 RPM drive, which means that if fsync is being
>> passed through to the disk correctly I can only fsync <120
>> times/second.  Here's what I get when I run sysbench on it, starting
>> with the default ext3 configuration:
>
> I believe it's ext3 who's cheating in this scenario.

I assume so too. Here the same test using XFS, first with barriers (XFS
default) and then without:

Linux 2.6.28-gentoo-r2 #1 SMP Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz
GenuineIntel GNU/Linux

/dev/sdb /data2 xfs rw,noatime,attr2,logbufs=8,logbsize=256k,noquota 0 0

# sysbench --test=fileio --file-fsync-freq=1 --file-num=1
--file-total-size=16384 --file-test-mode=rndwr run
sysbench 0.4.10:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 16Kb each
16Kb total file size
Block size 16Kb
Number of random requests for random IO: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 1 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random write test
Threads started!
Done.

Operations performed:  0 Read, 10000 Write, 10000 Other = 20000 Total
Read 0b  Written 156.25Mb  Total transferred 156.25Mb  (463.9Kb/sec)
    28.99 Requests/sec executed

Test execution summary:
     total time:                          344.9013s
     total number of events:              10000
     total time taken by event execution: 0.1453
     per-request statistics:
          min:                                  0.01ms
          avg:                                  0.01ms
          max:                                  0.07ms
          approx.  95 percentile:               0.01ms

Threads fairness:
     events (avg/stddev):           10000.0000/0.00
     execution time (avg/stddev):   0.1453/0.00


And now without barriers:

/dev/sdb /data2 xfs
rw,noatime,attr2,nobarrier,logbufs=8,logbsize=256k,noquota 0 0

# sysbench --test=fileio --file-fsync-freq=1 --file-num=1
--file-total-size=16384 --file-test-mode=rndwr run
sysbench 0.4.10:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 16Kb each
16Kb total file size
Block size 16Kb
Number of random requests for random IO: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 1 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random write test
Threads started!
Done.

Operations performed:  0 Read, 10000 Write, 10000 Other = 20000 Total
Read 0b  Written 156.25Mb  Total transferred 156.25Mb  (62.872Mb/sec)
  4023.81 Requests/sec executed

Test execution summary:
     total time:                          2.4852s
     total number of events:              10000
     total time taken by event execution: 0.1325
     per-request statistics:
          min:                                  0.01ms
          avg:                                  0.01ms
          max:                                  0.06ms
          approx.  95 percentile:               0.01ms

Threads fairness:
     events (avg/stddev):           10000.0000/0.00
     execution time (avg/stddev):   0.1325/0.00


--
Best regards,
Hannes Dorbath

Re: Raid 10 chunksize

From
Mark Kirkwood
Date:
Mark Kirkwood wrote:
> Rebuilt with 256K chunksize:
>
> transaction type: TPC-B (sort of)
> scaling factor: 100
> number of clients: 24
> number of transactions per client: 12000
> number of transactions actually processed: 288000/288000
> tps = 942.852104 (including connections establishing)
> tps = 943.019223 (excluding connections establishing)
>

Increasing checkpoint_segments to 96 and decreasing
bgwriter_lru_maxpages to 100:

transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 1219.221721 (including connections establishing)
tps = 1219.501150 (excluding connections establishing)

... as suggested by Greg (actually he suggested reducing
bgwriter_lru_maxpages to 0, but this seemed to be no better). Anyway,
seeing quite a reasonable improvement (about 83% from where we started).
It will be interesting to see how/if the improvements measured in
pgbench translate into the "real" application. Thanks for all your help
(particularly to both Scotts, Greg and Stef).

regards

Mark

Re: Raid 10 chunksize

From
Greg Smith
Date:
Hannes sent this off-list, presumably via newsgroup, and it's certainly
worth sharing.  I've always been scared off of using XFS because of the
problems outlined at http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc ,
with more testing showing similar issues at
http://pages.cs.wisc.edu/~vshree/xfs.pdf too

(I'm finding that old message with Ted saying "Making sure you don't lose
data is Job #1" hilarious right now, consider the recent ext4 data loss
debacle)

---------- Forwarded message ----------
Date: Fri, 3 Apr 2009 10:19:38 +0200
From: Hannes Dorbath <light@theendofthetunnel.de>
Newsgroups: pgsql.performance
Subject: Re: [PERFORM] Raid 10 chunksize

Ron Mayer wrote:
> Greg Smith wrote:
>> On Wed, 1 Apr 2009, Scott Carey wrote:
>>
>>> Write caching on SATA is totally fine.  There were some old ATA drives
>>> that when paried with some file systems or OS's would not be safe.  There
>>> are
>>> some combinations that have unsafe write barriers.  But there is a
>>> standard
>>> well supported ATA command to sync and only return after the data is on
>>> disk.  If you are running an OS that is anything recent at all, and any
>>> disks that are not really old, you're fine.
>> While I would like to believe this, I don't trust any claims in this
>> area that don't have matching tests that demonstrate things working as
>> expected.  And I've never seen this work.
>>
>> My laptop has a 7200 RPM drive, which means that if fsync is being
>> passed through to the disk correctly I can only fsync <120
>> times/second.  Here's what I get when I run sysbench on it, starting
>> with the default ext3 configuration:
>
> I believe it's ext3 who's cheating in this scenario.

I assume so too. Here the same test using XFS, first with barriers (XFS
default) and then without:

Linux 2.6.28-gentoo-r2 #1 SMP Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz
GenuineIntel GNU/Linux

/dev/sdb /data2 xfs rw,noatime,attr2,logbufs=8,logbsize=256k,noquota 0 0

# sysbench --test=fileio --file-fsync-freq=1 --file-num=1
--file-total-size=16384 --file-test-mode=rndwr run
sysbench 0.4.10:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 16Kb each
16Kb total file size
Block size 16Kb
Number of random requests for random IO: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 1 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random write test
Threads started!
Done.

Operations performed:  0 Read, 10000 Write, 10000 Other = 20000 Total
Read 0b  Written 156.25Mb  Total transferred 156.25Mb  (463.9Kb/sec)
    28.99 Requests/sec executed

Test execution summary:
     total time:                          344.9013s
     total number of events:              10000
     total time taken by event execution: 0.1453
     per-request statistics:
          min:                                  0.01ms
          avg:                                  0.01ms
          max:                                  0.07ms
          approx.  95 percentile:               0.01ms

Threads fairness:
     events (avg/stddev):           10000.0000/0.00
     execution time (avg/stddev):   0.1453/0.00


And now without barriers:

/dev/sdb /data2 xfs rw,noatime,attr2,nobarrier,logbufs=8,logbsize=256k,noquota
0 0

# sysbench --test=fileio --file-fsync-freq=1 --file-num=1
--file-total-size=16384 --file-test-mode=rndwr run
sysbench 0.4.10:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 16Kb each
16Kb total file size
Block size 16Kb
Number of random requests for random IO: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 1 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random write test
Threads started!
Done.

Operations performed:  0 Read, 10000 Write, 10000 Other = 20000 Total
Read 0b  Written 156.25Mb  Total transferred 156.25Mb  (62.872Mb/sec)
  4023.81 Requests/sec executed

Test execution summary:
     total time:                          2.4852s
     total number of events:              10000
     total time taken by event execution: 0.1325
     per-request statistics:
          min:                                  0.01ms
          avg:                                  0.01ms
          max:                                  0.06ms
          approx.  95 percentile:               0.01ms

Threads fairness:
     events (avg/stddev):           10000.0000/0.00
     execution time (avg/stddev):   0.1325/0.00


--
Best regards,
Hannes Dorbath

Re: Raid 10 chunksize

From
Greg Smith
Date:
On Thu, 2 Apr 2009, James Mansion wrote:

> Might have to give the whole disk to ZFS with Solaris to give it
> confidence to enable write cache

Confidence, sure, but not necessarily performance at the same time.  The
ZFS Kool-Aid gets bitter sometimes too, and I worry that its reputation
causes people to just trust it when they should be wary. If there's
anything this thread does, I hope it helps demonstrate how easy it is to
discover reality doesn't match expectations at all in this very messy
area.  Trust No One!  Keep Your Laser Handy!

There's a summary of the expected happy ZFS actions at
http://www.opensolaris.org/jive/thread.jspa?messageID=19264& and a good
cautionary tale of unhappy ZFS behavior in this area at
http://blogs.digitar.com/jjww/2006/12/shenanigans-with-zfs-flushing-and-intelligent-arrays/
and its follow-up
http://blogs.digitar.com/jjww/2007/10/back-in-the-sandbox-zfs-flushing-shenanigans-revisted/

Systems with a hardware write cache are pretty common on this list, which
makes the situation described there not that unlikely to run into.  The
official word here is at

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#FLUSH

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Raid 10 chunksize

From
Greg Smith
Date:
On Thu, 2 Apr 2009, Scott Carey wrote:

> The big one, is this quote from the linux kernel list:
> " Right now, if you want a reliable database on Linux, you _cannot_
> properly depend on fsync() or fdatasync().  Considering how much Linux
> is used for critical databases, using these functions, this amazes me.
> "

Things aren't as bad as that out of context quote makes them seem.  There
are two main problem situations here:

1) You cannot trust Linux to flush data to a hard drive's write cache.
Solution:  turn off the write cache.  Given the general poor state of
targeted fsync on Linux (quoting from a downthread comment by David Lang:
"in data=ordered mode, the default for most distros, ext3 can end up
having to write all pending data when you do a fsync on one file"), those
fsyncs were likely to blow out the drive cache anyway.

2) There are no hard guarantees about write ordering at the disk level; if
you write blocks ABC and then fsync, you might actually get, say, only B
written before power goes out.  I don't believe the PostgreSQL WAL design
will be corrupted by this particular situation, because until that fsync
comes back saying all 3 are done none of them are relied upon.

> Interestingly, postgres would be safer on linux if it used
> sync_file_range instead of fsync() but that has other drawbacks and
> limitations

I have thought about whether it would be possible to add a Linux-specific
improvement here into the code path that does something custom in this
area for Windows/Mac OS X when you use fsync_method=fsync_writethrough

We really should update the documentation in this area before 8.4 ships.
I'm looking into moving the "Tuning PostgreSQL WAL Synchronization" paper
I wrote onto the wiki and then fleshing it out with all this
filesystem-specific trivia.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Raid 10 chunksize

From
david@lang.hm
Date:
On Fri, 3 Apr 2009, Greg Smith wrote:

> Hannes sent this off-list, presumably via newsgroup, and it's certainly worth
> sharing.  I've always been scared off of using XFS because of the problems
> outlined at http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc , with more
> testing showing similar issues at http://pages.cs.wisc.edu/~vshree/xfs.pdf
> too
>
> (I'm finding that old message with Ted saying "Making sure you don't lose
> data is Job #1" hilarious right now, consider the recent ext4 data loss
> debacle)

also note that the message from Ted was back in 2004, there has been a
_lot_ of work done on XFS in the last 4 years.

as for the second link, that focuses on what happens to the filesystem if
the disk under it starts returning errors or garbage. with the _possible_
exception of ZFS, every filesystem around will do strange things under
those conditions. and in my option, the way to deal with this sort of
thing isn't to move to ZFS to detect the problem, it's to setup redundancy
in your storage so that you can not only detect the problem, but correct
it as well (it's a good thing to know that your database file is corrupt,
but that's not nearly as useful as having some way to recover the data
that was there)

David Lang

> ---------- Forwarded message ----------
> Date: Fri, 3 Apr 2009 10:19:38 +0200
> From: Hannes Dorbath <light@theendofthetunnel.de>
> Newsgroups: pgsql.performance
> Subject: Re: [PERFORM] Raid 10 chunksize
>
> Ron Mayer wrote:
>> Greg Smith wrote:
>>> On Wed, 1 Apr 2009, Scott Carey wrote:
>>>
>>>> Write caching on SATA is totally fine.  There were some old ATA drives
>>>> that when paried with some file systems or OS's would not be safe.  There
>>>> are
>>>> some combinations that have unsafe write barriers.  But there is a
>>>> standard
>>>> well supported ATA command to sync and only return after the data is on
>>>> disk.  If you are running an OS that is anything recent at all, and any
>>>> disks that are not really old, you're fine.
>>> While I would like to believe this, I don't trust any claims in this
>>> area that don't have matching tests that demonstrate things working as
>>> expected.  And I've never seen this work.
>>>
>>> My laptop has a 7200 RPM drive, which means that if fsync is being
>>> passed through to the disk correctly I can only fsync <120
>>> times/second.  Here's what I get when I run sysbench on it, starting
>>> with the default ext3 configuration:
>>
>> I believe it's ext3 who's cheating in this scenario.
>
> I assume so too. Here the same test using XFS, first with barriers (XFS
> default) and then without:
>
> Linux 2.6.28-gentoo-r2 #1 SMP Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz
> GenuineIntel GNU/Linux
>
> /dev/sdb /data2 xfs rw,noatime,attr2,logbufs=8,logbsize=256k,noquota 0 0
>
> # sysbench --test=fileio --file-fsync-freq=1 --file-num=1
> --file-total-size=16384 --file-test-mode=rndwr run
> sysbench 0.4.10:  multi-threaded system evaluation benchmark
>
> Running the test with following options:
> Number of threads: 1
>
> Extra file open flags: 0
> 1 files, 16Kb each
> 16Kb total file size
> Block size 16Kb
> Number of random requests for random IO: 10000
> Read/Write ratio for combined random IO test: 1.50
> Periodic FSYNC enabled, calling fsync() each 1 requests.
> Calling fsync() at the end of test, Enabled.
> Using synchronous I/O mode
> Doing random write test
> Threads started!
> Done.
>
> Operations performed:  0 Read, 10000 Write, 10000 Other = 20000 Total
> Read 0b  Written 156.25Mb  Total transferred 156.25Mb  (463.9Kb/sec)
>   28.99 Requests/sec executed
>
> Test execution summary:
>    total time:                          344.9013s
>    total number of events:              10000
>    total time taken by event execution: 0.1453
>    per-request statistics:
>         min:                                  0.01ms
>         avg:                                  0.01ms
>         max:                                  0.07ms
>         approx.  95 percentile:               0.01ms
>
> Threads fairness:
>    events (avg/stddev):           10000.0000/0.00
>    execution time (avg/stddev):   0.1453/0.00
>
>
> And now without barriers:
>
> /dev/sdb /data2 xfs
> rw,noatime,attr2,nobarrier,logbufs=8,logbsize=256k,noquota 0 0
>
> # sysbench --test=fileio --file-fsync-freq=1 --file-num=1
> --file-total-size=16384 --file-test-mode=rndwr run
> sysbench 0.4.10:  multi-threaded system evaluation benchmark
>
> Running the test with following options:
> Number of threads: 1
>
> Extra file open flags: 0
> 1 files, 16Kb each
> 16Kb total file size
> Block size 16Kb
> Number of random requests for random IO: 10000
> Read/Write ratio for combined random IO test: 1.50
> Periodic FSYNC enabled, calling fsync() each 1 requests.
> Calling fsync() at the end of test, Enabled.
> Using synchronous I/O mode
> Doing random write test
> Threads started!
> Done.
>
> Operations performed:  0 Read, 10000 Write, 10000 Other = 20000 Total
> Read 0b  Written 156.25Mb  Total transferred 156.25Mb  (62.872Mb/sec)
> 4023.81 Requests/sec executed
>
> Test execution summary:
>    total time:                          2.4852s
>    total number of events:              10000
>    total time taken by event execution: 0.1325
>    per-request statistics:
>         min:                                  0.01ms
>         avg:                                  0.01ms
>         max:                                  0.06ms
>         approx.  95 percentile:               0.01ms
>
> Threads fairness:
>    events (avg/stddev):           10000.0000/0.00
>    execution time (avg/stddev):   0.1325/0.00
>
>
>

Re: Raid 10 chunksize

From
Greg Smith
Date:
On Fri, 3 Apr 2009, david@lang.hm wrote:

> also note that the message from Ted was back in 2004, there has been a _lot_
> of work done on XFS in the last 4 years.

Sure, I know they've made progress, which is why I didn't also bring up
older ugly problems like delayed allocation issues reducing files to zero
length on XFS.  I thought that particular issue was pretty fundamental to
the logical journal scheme XFS is based on.  What's you'll get out of disk
I/O at smaller than the block level is pretty unpredictable when there's a
failure.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Raid 10 chunksize

From
Scott Carey
Date:

On 4/3/09 6:05 PM, "david@lang.hm" <david@lang.hm> wrote:

> On Fri, 3 Apr 2009, Greg Smith wrote:
>
>> Hannes sent this off-list, presumably via newsgroup, and it's certainly worth
>> sharing.  I've always been scared off of using XFS because of the problems
>> outlined at http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc , with more
>> testing showing similar issues at http://pages.cs.wisc.edu/~vshree/xfs.pdf
>> too
>>
>> (I'm finding that old message with Ted saying "Making sure you don't lose
>> data is Job #1" hilarious right now, consider the recent ext4 data loss
>> debacle)
>
> also note that the message from Ted was back in 2004, there has been a
> _lot_ of work done on XFS in the last 4 years.
>
> as for the second link, that focuses on what happens to the filesystem if
> the disk under it starts returning errors or garbage. with the _possible_
> exception of ZFS, every filesystem around will do strange things under
> those conditions. and in my option, the way to deal with this sort of
> thing isn't to move to ZFS to detect the problem, it's to setup redundancy
> in your storage so that you can not only detect the problem, but correct
> it as well (it's a good thing to know that your database file is corrupt,
> but that's not nearly as useful as having some way to recover the data
> that was there)

Not trying to spread too much kool-aid around, but ZFS does that.

If a mirror set (which might be 2, 3 or more copies in the mirror) detects a
checksum error, it reads the other copies and attempts to correct the bad
block.
PLUS, the performance under normal conditions for reads scales with the
mirrors.  12 disks in raid 10 do writes as fast as 6 disk raid 0, but reads
as fast as 12 disk raid 0 since it does not have to read both mirror sets to
detect an error, only to recover.  You can even just write zeros to random
spots in a mirror and it will throw errors and use the other copies.

This really isn't a ZFS promotion, rather its a promotion of the power of
checksums at the file system and raid level.  A hardware raid card could
just as well sacrifice some space to place checksums on its blocks and get
much the same result.


>
> David Lang
>