Thread: RAID Stripe size
Hi Everyone The machine is IBM x345 with ServeRAID 6i 128mb cache and 6 SCSI 15k disks. 2 disks are in RAID1 and hold the OS, SWAP & pg_xlog 4 disks are in RAID10 and hold the Cluster itself. the DB will have two major tables 1 with 10 million rows and one with 100 million rows. All the activities against this tables will be SELECT. Currently the strip size is 8k. I read in many place this is a poor setting. Am i right ?
bm\mbn wrote: > Hi Everyone > > The machine is IBM x345 with ServeRAID 6i 128mb cache and 6 SCSI 15k > disks. > > 2 disks are in RAID1 and hold the OS, SWAP & pg_xlog > 4 disks are in RAID10 and hold the Cluster itself. > > the DB will have two major tables 1 with 10 million rows and one with > 100 million rows. > All the activities against this tables will be SELECT. What type of SELECTs will you be doing? Mostly sequential reads of a bunch of data, or indexed lookups of random pieces? > > Currently the strip size is 8k. I read in many place this is a poor > setting. From what I've heard of RAID, if you are doing large sequential transfers, larger stripe sizes (128k, 256k) generally perform better. For postgres, though, when you are writing, having the stripe size be around the same size as your page size (8k) could be advantageous, as when postgres reads a page, it only reads a single stripe. So if it were reading a series of pages, each one would come from a different disk. I may be wrong about that, though. John =:-> > > Am i right ?
Attachment
Hi John A Meinel wrote: >bm\mbn wrote: > > >>Hi Everyone >> >>The machine is IBM x345 with ServeRAID 6i 128mb cache and 6 SCSI 15k >>disks. >> >>2 disks are in RAID1 and hold the OS, SWAP & pg_xlog >>4 disks are in RAID10 and hold the Cluster itself. >> >>the DB will have two major tables 1 with 10 million rows and one with >>100 million rows. >>All the activities against this tables will be SELECT. >> >> > >What type of SELECTs will you be doing? Mostly sequential reads of a >bunch of data, or indexed lookups of random pieces? > > All of them. some Rtree some btree some without using indexes. > > >>Currently the strip size is 8k. I read in many place this is a poor >>setting. >> >> > >From what I've heard of RAID, if you are doing large sequential >transfers, larger stripe sizes (128k, 256k) generally perform better. >For postgres, though, when you are writing, having the stripe size be >around the same size as your page size (8k) could be advantageous, as >when postgres reads a page, it only reads a single stripe. So if it were >reading a series of pages, each one would come from a different disk. > >I may be wrong about that, though. > > I must admit im a bit amazed how such important parameter is so ambiguous. an optimal strip size can improve the performance of the db significantly. I bet that the difference in performance between a poor stripe setting to an optimal one is more important then how much RAM or CPU you have. I hope to run some tests soon thugh i have limited time on the production server to do such tests. >John >=:-> > > > >>Am i right ? >> >> > > > -- -------------------------- Canaan Surfing Ltd. Internet Service Providers Ben-Nes Michael - Manager Tel: 972-4-6991122 Cel: 972-52-8555757 Fax: 972-4-6990098 http://www.canaan.net.il --------------------------
On Tue, Sep 20, 2005 at 10:51:41AM +0300, Michael Ben-Nes wrote: >I must admit im a bit amazed how such important parameter is so >ambiguous. an optimal strip size can improve the performance of the db >significantly. It's configuration dependent. IME, it has an insignificant effect. If anything, changing it from the vendor default may make performance worse (maybe the firmware on the array is tuned for a particular size?) >I bet that the difference in performance between a poor stripe setting >to an optimal one is more important then how much RAM or CPU you have. I'll take that bet, because I've benched it. If something so trivial (and completely free) was the single biggest factor in performance, do you really think it would be an undiscovered secret? >I hope to run some tests soon thugh i have limited time on the >production server to do such tests. Well, benchmarking your data on your hardware is the first thing you should do, not something you should try to cram in late in the game. You can't get a valid answer to a "what's the best configuration" question until you've tested some configurations. Mike Stone
Hi Everybody! I've got a spare machine which is 2xXEON 3.2GHz, 4Gb RAM 14x140Gb SCSI 10k (LSI MegaRaid 320U). It is going into production in 3-5months. I do have free time to run tests on this machine, and I could test different stripe sizes if somebody prepares a test script and data for that. I could also test different RAID modes 0,1,5 and 10 for this script. I guess the community needs these results. On 16 Sep 2005 04:51:43 -0700 "bm\\mbn" <miki@canaan.co.il> wrote: > Hi Everyone > > The machine is IBM x345 with ServeRAID 6i 128mb cache and 6 SCSI 15k > disks. > > 2 disks are in RAID1 and hold the OS, SWAP & pg_xlog > 4 disks are in RAID10 and hold the Cluster itself. > > the DB will have two major tables 1 with 10 million rows and one with > 100 million rows. > All the activities against this tables will be SELECT. > > Currently the strip size is 8k. I read in many place this is a poor > setting. > > Am i right ? -- Evgeny Gridasov Software Developer I-Free, Russia
Typically your stripe size impacts read and write. In Solaris, the trick is to match it with your maxcontig parameter. If you set maxcontig to 128 pages which is 128* 8 = 1024k (1M) then your optimal stripe size is 128 * 8 / (number of spindles in LUN).. Assuming number of spindles is 6 then you get an odd number. In such cases either your current io or the next sequential io is going to be little bit inefficient depending on what you select (as a rule of thumb however just take the closest stripe size). However if your number of spindles matches 8 then you get a perfect 128 and hence makes sense to select 128K. (Maxcontig is a paramter in Solaris which defines the max contiguous space allocated to a block which really helps in case of sequential io operations). But as you see this was maxcontig dependent in my case. What if your maxcontig is way off track. This can happen if your io pattern is more and more random. In such cases maxcontig is better at lower numbers to reduce space wastage and in effect reducing your stripe size reduces your responde time. This means now it is Workload dependent... Random IOs or Sequential IOs (atleast where IOs can be clubbed together). As you can see stripe size in Solaris is eventually dependent on your Workload. Typically my guess is on any other platform, the stripe size is dependent on your Workload and how it will access the data. Lower stripe size helps smaller IOs perform better but lack total throughtput efficiency. While larger stripe size increases throughput efficiency at the cost of response time of your small IO requirements. Don't forget many file systems will buffer your IOs and can club them together if it finds them sequential from its point of view. Hence in such cases the effective IO size is what matters for raid sizes. If you effective IO sizes are big then go for higher raid size. If your effective IO sizes are small and response time is critical go for smaller raid sizes Regards, Jignesh evgeny gridasov wrote: >Hi Everybody! > >I've got a spare machine which is 2xXEON 3.2GHz, 4Gb RAM >14x140Gb SCSI 10k (LSI MegaRaid 320U). It is going into production in 3-5months. >I do have free time to run tests on this machine, and I could test different stripe sizes >if somebody prepares a test script and data for that. > >I could also test different RAID modes 0,1,5 and 10 for this script. > >I guess the community needs these results. > >On 16 Sep 2005 04:51:43 -0700 >"bm\\mbn" <miki@canaan.co.il> wrote: > > > >>Hi Everyone >> >>The machine is IBM x345 with ServeRAID 6i 128mb cache and 6 SCSI 15k >>disks. >> >>2 disks are in RAID1 and hold the OS, SWAP & pg_xlog >>4 disks are in RAID10 and hold the Cluster itself. >> >>the DB will have two major tables 1 with 10 million rows and one with >>100 million rows. >>All the activities against this tables will be SELECT. >> >>Currently the strip size is 8k. I read in many place this is a poor >>setting. >> >>Am i right ? >> >> > > > -- ______________________________ Jignesh K. Shah MTS Software Engineer, MDE - Horizontal Technologies Sun Microsystems, Inc Phone: (781) 442 3052 Email: J.K.Shah@sun.com ______________________________
I have benched different sripe sizes with different file systems, and the perfmance differences can be quite dramatic.
Theoreticaly a smaller stripe is better for OLTP as you can write more small transactions independantly to more different disks more often than not, but a large stripe size is good for Data warehousing as you are often doing very large sequential reads, and a larger stripe size is going to exploit the on-drive cache as you request larger single chunks from the disk at a time.
It also seems that different controllers are partial to different defaults that can affect their performance, so I would suggest that testing this on two different controller cards man be less than optimal.
I would also recommend looking at file system. For us JFS worked signifcantly faster than resier for large read loads and large write loads, so we chose JFS over ext3 and reiser.
I found that lower stripe sizes impacted performance badly as did overly large stripe sizes.
Alex Turner
NetEconomist
Theoreticaly a smaller stripe is better for OLTP as you can write more small transactions independantly to more different disks more often than not, but a large stripe size is good for Data warehousing as you are often doing very large sequential reads, and a larger stripe size is going to exploit the on-drive cache as you request larger single chunks from the disk at a time.
It also seems that different controllers are partial to different defaults that can affect their performance, so I would suggest that testing this on two different controller cards man be less than optimal.
I would also recommend looking at file system. For us JFS worked signifcantly faster than resier for large read loads and large write loads, so we chose JFS over ext3 and reiser.
I found that lower stripe sizes impacted performance badly as did overly large stripe sizes.
Alex Turner
NetEconomist
On 16 Sep 2005 04:51:43 -0700, bmmbn <miki@canaan.co.il> wrote:
Hi Everyone
The machine is IBM x345 with ServeRAID 6i 128mb cache and 6 SCSI 15k
disks.
2 disks are in RAID1 and hold the OS, SWAP & pg_xlog
4 disks are in RAID10 and hold the Cluster itself.
the DB will have two major tables 1 with 10 million rows and one with
100 million rows.
All the activities against this tables will be SELECT.
Currently the strip size is 8k. I read in many place this is a poor
setting.
Am i right ?
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
Alex Turner wrote: > I would also recommend looking at file system. For us JFS worked signifcantly > faster than resier for large read loads and large write loads, so we chose JFS > over ext3 and reiser. has jfs been reliable for you? there seems to be a lot of conjecture about instability, but i find jfs a potentially attractive alternative for a number of reasons. richard
I have found JFS to be just fine. We have been running a medium load on this server for 9 months with no unscheduled down time. Datbase is about 30gig on disk, and we get about 3-4 requests per second that generate results sets in the thousands from about 8am to about 11pm.
I have foudn that JFS barfs if you put a million files in a directory and try to do an 'ls', but then so did reiser, only Ext3 handled this test succesfully. Fortunately with a database, this is an atypical situation, so JFS has been fine for DB for us so far.
We have had severe problems with Ext3 when file systems hit 100% usage, they get all kinds of unhappy, we haven't had the same problem with JFS.
Alex Turner
NetEconomist
I have foudn that JFS barfs if you put a million files in a directory and try to do an 'ls', but then so did reiser, only Ext3 handled this test succesfully. Fortunately with a database, this is an atypical situation, so JFS has been fine for DB for us so far.
We have had severe problems with Ext3 when file systems hit 100% usage, they get all kinds of unhappy, we haven't had the same problem with JFS.
Alex Turner
NetEconomist
On 9/20/05, Welty, Richard <richard.welty@bankofamerica.com> wrote:
Alex Turner wrote:
> I would also recommend looking at file system. For us JFS worked signifcantly
> faster than resier for large read loads and large write loads, so we chose JFS
> over ext3 and reiser.
has jfs been reliable for you? there seems to be a lot of conjecture about instability,
but i find jfs a potentially attractive alternative for a number of reasons.
richard
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
We have a production server(8.0.2) running 24x7, 300k+ transactions per day. Linux 2.6.11 / JFS file system. No problems. It works faster than ext3. > Alex Turner wrote: > > > I would also recommend looking at file system. For us JFS worked signifcantly > > faster than resier for large read loads and large write loads, so we chose JFS > > over ext3 and reiser. > > has jfs been reliable for you? there seems to be a lot of conjecture about instability, > but i find jfs a potentially attractive alternative for a number of reasons. > > richard > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match -- Evgeny Gridasov Software Developer I-Free, Russia
Hi Everybody. I am going to replace some 'select count(*) from ... where ...' queries which run on large tables (10M+ rows) with something like 'explain select * from ... where ....' and parse planner output after that to find out its forecast about number of rows the query is going to retrieve. Since my users do not need exact row count for large tables, this will boost performance for my application. I ran some queries with explain and explain analyze then. If i set statistics number for the table about 200-300 the planner forecast seems to be working very fine. My questions are: 1. Is there a way to interact with postgresql planner, other than 'explain ...'? An aggregate query like 'select estimate_count(*)from ...' would really help =)) 2. How precise is the planner row count forecast given for a complex query (select with 3-5 joint tables,aggregates,subselects,etc...)? -- Evgeny Gridasov Software Developer I-Free, Russia
evgeny gridasov wrote: > Hi Everybody. > > I am going to replace some 'select count(*) from ... where ...' queries > which run on large tables (10M+ rows) with something like > 'explain select * from ... where ....' and parse planner output after that > to find out its forecast about number of rows the query is going to retrieve. > > Since my users do not need exact row count for large tables, this will > boost performance for my application. I ran some queries with explain and > explain analyze then. If i set statistics number for the table about 200-300 > the planner forecast seems to be working very fine. > > My questions are: > 1. Is there a way to interact with postgresql planner, other than 'explain ...'? An aggregate query like 'select estimate_count(*)from ...' would really help =)) > 2. How precise is the planner row count forecast given for a complex query (select with 3-5 joint tables,aggregates,subselects,etc...)? > > I think that this has been done before. Check the list archives (I believe it may have been Michael Fuhr?) ah, check this: http://archives.postgresql.org/pgsql-sql/2005-08/msg00046.php -- _______________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. _______________________________