Thread: new server I/O setup

new server I/O setup

From
"Fernando Hevia"
Date:
Hi all,
 
I've just received this new server:
1 x XEON 5520 Quad Core w/ HT
8 GB RAM 1066 MHz
16 x SATA II Seagate Barracuda 7200.12
3ware 9650SE w/ 256MB BBU
 
It will run an Ubuntu 8.04 LTS Postgres 8.4 dedicated server. Its database will be getting between 100 and 1000 inserts per second (those are call detail records of ~300 bytes each) of around 20 clients (voip gateways). Other activity is mostly read-only and some non time-critical writes generally at off peak hours.
 
So my first choice was:
 
2 discs in RAID 1 for OS + pg_xlog partitioned with ext2.
12 discs in RAID 10 for postgres data, sole partition with ext3.
2 spares
 
 
My second choice is:
 
4 discs in RAID 10 for OS + pg_xlog partitioned with ext2
10 discs in RAID 10 for postgres, ext3
2 spares.
 
The bbu caché will be enabled for both raid volumes.
 
I justified my first choice in that WAL writes are sequentially and OS pretty much are too, so a RAID 1 probably would hold ground against a 12 disc RAID 10 with random writes.
 
I don't know in advance if I will manage to gather enough time to try out both setups so I wanted to know what you guys think of these 2 alternatives. Do you think a single RAID 1 will become a bottleneck? Feel free to suggest a better setup I hadn't considered, it would be most welcome.
 
Pd: any clue if hdparm works to deactive the disks write cache even if they are behind the 3ware controller?
 
Regards,
Fernando.
 

Re: new server I/O setup

From
Scott Marlowe
Date:
On Thu, Jan 14, 2010 at 1:03 PM, Fernando Hevia <fhevia@ip-tel.com.ar> wrote:
> Hi all,
>
> I've just received this new server:
> 1 x XEON 5520 Quad Core w/ HT
> 8 GB RAM 1066 MHz
> 16 x SATA II Seagate Barracuda 7200.12
> 3ware 9650SE w/ 256MB BBU
>
> It will run an Ubuntu 8.04 LTS Postgres 8.4 dedicated server. Its database
> will be getting between 100 and 1000 inserts per second (those are call
> detail records of ~300 bytes each) of around 20 clients (voip gateways).
> Other activity is mostly read-only and some non time-critical writes
> generally at off peak hours.
>
> So my first choice was:
>
> 2 discs in RAID 1 for OS + pg_xlog partitioned with ext2.
> 12 discs in RAID 10 for postgres data, sole partition with ext3.
> 2 spares
>
>
> My second choice is:
>
> 4 discs in RAID 10 for OS + pg_xlog partitioned with ext2
> 10 discs in RAID 10 for postgres, ext3
> 2 spares.
>
> The bbu caché will be enabled for both raid volumes.
>
> I justified my first choice in that WAL writes are sequentially and OS
> pretty much are too, so a RAID 1 probably would hold ground against a 12
> disc RAID 10 with random writes.

I think your first choice is right.  I use the same basic setup with
147G 15k5 SAS seagate drives and the pg_xlog / OS partition is almost
never close to the same level of utilization, according to iostat, as
the main 12 disk RAID-10 array is.  We may have to buy a 16 disk array
to keep up with load, and it would be all main data storage, and our
pg_xlog main drive pair would be just fine.

> I don't know in advance if I will manage to gather enough time to try out
> both setups so I wanted to know what you guys think of these 2
> alternatives. Do you think a single RAID 1 will become a bottleneck? Feel
> free to suggest a better setup I hadn't considered, it would be most
> welcome.

For 12 disks, most likely not.  Especially since your load is mostly
small randomish writes, not a bunch of big multi-megabyte records or
anything, so the random access performance on the 12 disk RAID-10
should be your limiting factor.

> Pd: any clue if hdparm works to deactive the disks write cache even if they
> are behind the 3ware controller?

Not sure, but I'm pretty sure the 3ware card already does the right
thing and turns off the write caching.

Re: new server I/O setup

From
Greg Smith
Date:
Fernando Hevia wrote:
I justified my first choice in that WAL writes are sequentially and OS pretty much are too, so a RAID 1 probably would hold ground against a 12 disc RAID 10 with random writes.

The problem with this theory is that when PostgreSQL does WAL writes and asks to sync the data, you'll probably discover all of the open OS writes that were sitting in the Linux write cache getting flushed before that happens.  And that could lead to horrible performance--good luck if the database tries to do something after cron kicks off updatedb each night for example.

I think there are two viable configurations you should be considering you haven't thought about:
, but neither is quite what you're looking at:

2 discs in RAID 1 for OS
2 discs in RAID 1 for pg_xlog
10 discs in RAID 10 for postgres, ext3
2 spares.

14 discs in RAID 10 for everything
2 spares.

Impossible to say which of the four possibilities here will work out better.  I tend to lean toward the first one I listed above because it makes it very easy to monitor the pg_xlog activity (and the non-database activity) separately from everything else, and having no other writes going on makes it very unlikely that the pg_xlog will ever become a bottleneck.  But if you've got 14 disks in there, it's unlikely to be a bottleneck anyway.  The second config above will get you slightly better random I/O though, so for workloads that are really limited on that there's a good reason to prefer it.

Also:  the whole "use ext2 for the pg_xlog" idea is overrated far as I'm concerned.  I start with ext3, and only if I get evidence that the drive is a bottleneck do I ever think of reverting to unjournaled writes just to get a little speed boost.  In practice I suspect you'll see no benchmark difference, and will instead curse the decision the first time your server is restarted badly and it gets stuck at fsck.

Pd: any clue if hdparm works to deactive the disks write cache even if they are behind the 3ware controller?

You don't use hdparm for that sort of thing; you need to use 3ware's tw_cli utility.  I believe that the individual drive caches are always disabled, but whether the controller cache is turned on or not depends on whether the card has a battery.  The behavior here is kind of weird though--it changes if you're in RAID mode vs. JBOD mode, so be careful to look at what all the settings are.  Some of these 3ware cards default to extremely aggressive background scanning for bad blocks too, you might have to tweak that downward too.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: new server I/O setup

From
Matthew Wakeling
Date:
On Thu, 14 Jan 2010, Scott Marlowe wrote:
>> I've just received this new server:
>> 1 x XEON 5520 Quad Core w/ HT
>> 8 GB RAM 1066 MHz
>> 16 x SATA II Seagate Barracuda 7200.12
>> 3ware 9650SE w/ 256MB BBU
>>
>> 2 discs in RAID 1 for OS + pg_xlog partitioned with ext2.
>> 12 discs in RAID 10 for postgres data, sole partition with ext3.
>> 2 spares
>
> I think your first choice is right.  I use the same basic setup with
> 147G 15k5 SAS seagate drives and the pg_xlog / OS partition is almost
> never close to the same level of utilization, according to iostat, as
> the main 12 disk RAID-10 array is.  We may have to buy a 16 disk array
> to keep up with load, and it would be all main data storage, and our
> pg_xlog main drive pair would be just fine.

The benefits of splitting off a couple of discs for WAL are dubious given
the BBU cache, given that the cache will convert the frequent fsyncs to
sequential writes anyway. My advice would be to test the difference. If
the bottleneck is random writes on the 12-disc array, then it may actually
help more to improve that to a 14-disc array instead.

I'd also question whether you need two hot spares, with RAID-10. Obviously
that's a judgement call only you can make, but you could consider whether
it is sufficient to just have a spare disc sitting on a shelf next to the
server rather than using up a slot in the server. Depends on how quickly
you can get to the server on failure, and how important the data is.

Matthew

--
 In the beginning was the word, and the word was unsigned,
 and the main() {} was without form and void...

Re: new server I/O setup

From
"Fernando Hevia"
Date:

> -----Mensaje original-----
> De: Scott Marlowe
>
> I think your first choice is right.  I use the same basic
> setup with 147G 15k5 SAS seagate drives and the pg_xlog / OS
> partition is almost never close to the same level of
> utilization, according to iostat, as the main 12 disk RAID-10
> array is.  We may have to buy a 16 disk array to keep up with
> load, and it would be all main data storage, and our pg_xlog
> main drive pair would be just fine.
>


> > Do you think a single RAID 1 will become a
> bottleneck?
> > Feel free to suggest a better setup I hadn't considered, it
> would be
> > most welcome.
>
> For 12 disks, most likely not.  Especially since your load is
> mostly small randomish writes, not a bunch of big
> multi-megabyte records or anything, so the random access
> performance on the 12 disk RAID-10 should be your limiting factor.
>

Good to know this setup has been tryied succesfully.
Thanks for the comments.


Re: new server I/O setup

From
"Fernando Hevia"
Date:

> -----Mensaje original-----
> De: Greg Smith
>
>> Fernando Hevia wrote:
>>
>>     I justified my first choice in that WAL writes are
>> sequentially and OS pretty much are too, so a RAID 1 probably
>> would hold ground against a 12 disc RAID 10 with random writes.
>>
>
> The problem with this theory is that when PostgreSQL does WAL
> writes and asks to sync the data, you'll probably discover
> all of the open OS writes that were sitting in the Linux
> write cache getting flushed before that happens.  And that
> could lead to horrible performance--good luck if the database
> tries to do something after cron kicks off updatedb each
> night for example.
>

I actually hadn't considered such a scenario. It probably wont hit us
because our real-time activity diminishes abruptly overnight when
maintainance routines kick in.
But in case this proves to be an issue, disabling synchronous_commit should
help out, and thanks to the BBU cache the risk of lost transactions should
be very low. In any case I would leave it on till the issue arises. Do you
agree?

In our business worst case situation could translate to losing a couple
seconds worth of call records, all recoverable from secondary storage.


> I think there are two viable configurations you should be
> considering you haven't thought about:
> , but neither is quite what you're looking at:
>
> 2 discs in RAID 1 for OS
> 2 discs in RAID 1 for pg_xlog
> 10 discs in RAID 10 for postgres, ext3
> 2 spares.
>
> 14 discs in RAID 10 for everything
> 2 spares.
>
> Impossible to say which of the four possibilities here will
> work out better.  I tend to lean toward the first one I
> listed above because it makes it very easy to monitor the
> pg_xlog activity (and the non-database activity) separately
> from everything else, and having no other writes going on
> makes it very unlikely that the pg_xlog will ever become a
> bottleneck.  But if you've got 14 disks in there, it's
> unlikely to be a bottleneck anyway.  The second config above
> will get you slightly better random I/O though, so for
> workloads that are really limited on that there's a good
> reason to prefer it.
>

Beside the random writing, we have quite intensive random reads too. I need
to maximize throughput on the RAID 10 array and it makes me feel rather
uneasy the thought of taking 2 more disks from it.
I did consider the 14 disks RAID 10 for all since it's very attractive for
read I/O. But with 12 spins read I/O should be incredibly fast for us
considering our current production server has a meager 4 disk raid 10.
I still think the 2d RAID 1 + 12d RAID 10 will be the best combination for
write throughput, providing the RAID 1 can keep pace with the RAID 10,
something Scott already confirmed to be his experience.

> Also:  the whole "use ext2 for the pg_xlog" idea is overrated
> far as I'm concerned.  I start with ext3, and only if I get
> evidence that the drive is a bottleneck do I ever think of
> reverting to unjournaled writes just to get a little speed
> boost.  In practice I suspect you'll see no benchmark
> difference, and will instead curse the decision the first
> time your server is restarted badly and it gets stuck at fsck.
>

This advice could be interpreted as "start safe and take risks only if
needed"
I think you are right and will follow it.

>>     Pd: any clue if hdparm works to deactive the disks
>> write cache even if they are behind the 3ware controller?
>>
>
> You don't use hdparm for that sort of thing; you need to use
> 3ware's tw_cli utility.  I believe that the individual drive
> caches are always disabled, but whether the controller cache
> is turned on or not depends on whether the card has a
> battery.  The behavior here is kind of weird though--it
> changes if you're in RAID mode vs. JBOD mode, so be careful
> to look at what all the settings are.  Some of these 3ware
> cards default to extremely aggressive background scanning for
> bad blocks too, you might have to tweak that downward too.
>

It has a battery and it is working in RAID mode.
It's also my first experience with a hardware controller. Im installing
tw_cli at this moment.

Greg, I hold your knowledge in this area in very high regard.
Your comments are much appreciated.


Thanks,
Fernando


Re: new server I/O setup

From
"Fernando Hevia"
Date:

> -----Mensaje original-----
> De: Matthew Wakeling [mailto:matthew@flymine.org]
> Enviado el: Viernes, 15 de Enero de 2010 08:21
> Para: Scott Marlowe
> CC: Fernando Hevia; pgsql-performance@postgresql.org
> Asunto: Re: [PERFORM] new server I/O setup
>
> On Thu, 14 Jan 2010, Scott Marlowe wrote:
> >> I've just received this new server:
> >> 1 x XEON 5520 Quad Core w/ HT
> >> 8 GB RAM 1066 MHz
> >> 16 x SATA II Seagate Barracuda 7200.12 3ware 9650SE w/ 256MB BBU
> >>
> >> 2 discs in RAID 1 for OS + pg_xlog partitioned with ext2.
> >> 12 discs in RAID 10 for postgres data, sole partition with ext3.
> >> 2 spares
> >
> > I think your first choice is right.  I use the same basic
> setup with
> > 147G 15k5 SAS seagate drives and the pg_xlog / OS partition
> is almost
> > never close to the same level of utilization, according to
> iostat, as
> > the main 12 disk RAID-10 array is.  We may have to buy a 16
> disk array
> > to keep up with load, and it would be all main data
> storage, and our
> > pg_xlog main drive pair would be just fine.
>
> The benefits of splitting off a couple of discs for WAL are
> dubious given the BBU cache, given that the cache will
> convert the frequent fsyncs to sequential writes anyway. My
> advice would be to test the difference. If the bottleneck is
> random writes on the 12-disc array, then it may actually help
> more to improve that to a 14-disc array instead.

I am new to the BBU cache benefit and I have a lot to experience and learn.
Hopefully I will have the time to tests both setups.
I was wondering if disabling the bbu cache on the RAID 1 array would make
any difference. All 256MB would be available for the random I/O on the RAID
10.

>
> I'd also question whether you need two hot spares, with
> RAID-10. Obviously that's a judgement call only you can make,
> but you could consider whether it is sufficient to just have
> a spare disc sitting on a shelf next to the server rather
> than using up a slot in the server. Depends on how quickly
> you can get to the server on failure, and how important the data is.
>

This is something I havent been able to make my mind since its very painful
to loose those 2 slots.
They could make for the dedicated pg_xlog RAID 1 Greg's suggesting.
Very tempting, but still think I will start safe for know and see what
happens later.

Thanks for your hindsight.

Regards,
Fernando.


Re: new server I/O setup

From
Matthew Wakeling
Date:
On Fri, 15 Jan 2010, Fernando Hevia wrote:
> I was wondering if disabling the bbu cache on the RAID 1 array would make
> any difference. All 256MB would be available for the random I/O on the RAID
> 10.

That would be pretty disastrous, to be honest. The benefit of the cache is
not only that it smooths random access, but it also accelerates fsync. The
whole point of the WAL disc is for it to be able to accept lots of fsyncs
very quickly, and it can't do that without its BBU cache.

Matthew

--
 Heat is work, and work's a curse. All the heat in the universe, it's
 going to cool down, because it can't increase, then there'll be no
 more work, and there'll be perfect peace.      -- Michael Flanders

Re: new server I/O setup

From
Pierre Frédéric Caillaud
Date:
    No-one has mentioned SSDs yet ?...

Re: new server I/O setup

From
"Fernando Hevia"
Date:

> -----Mensaje original-----
> De: Pierre Frédéric Caillaud
> Enviado el: Viernes, 15 de Enero de 2010 15:00
> Para: pgsql-performance@postgresql.org
> Asunto: Re: [PERFORM] new server I/O setup
>
>
>     No-one has mentioned SSDs yet ?...
>

The post is about an already purchased server just delivered to my office. I
have been following with interest posts about SSD benchmarking but no SSD
have been bought this oportunity and we have no budget to buy them either,
at least not in the foreseable future.


Re: new server I/O setup

From
Scott Marlowe
Date:
2010/1/15 Fernando Hevia <fhevia@ip-tel.com.ar>:
>
>
>> -----Mensaje original-----
>> De: Pierre Frédéric Caillaud
>> Enviado el: Viernes, 15 de Enero de 2010 15:00
>> Para: pgsql-performance@postgresql.org
>> Asunto: Re: [PERFORM] new server I/O setup
>>
>>
>>       No-one has mentioned SSDs yet ?...
>>
>
> The post is about an already purchased server just delivered to my office. I
> have been following with interest posts about SSD benchmarking but no SSD
> have been bought this oportunity and we have no budget to buy them either,
> at least not in the foreseable future.

And no matter how good they look on paper, being one of the first
people to use and in effect test them in production can be very
exciting.  And sometimes excitement isn't what you really want from
your production servers.