Thread: setting up raid10 with more than 4 drives
hi, this is not really postgresql specific, but any help is appreciated. i have read more spindles the better it is for IO performance. suppose i have 8 drives , should a stripe (raid0) be created on 2 mirrors (raid1) of 4 drives each OR should a stripe on 4 mirrors of 2 drives each be created ? also does single channel or dual channel controllers makes lot of difference in raid10 performance ? regds mallah.
Stripe of mirrors is preferred to mirror of stripes for the best balance of protection and performance. In the stripe of mirrors you can lose up to half of the disks and still be operational. In the mirror of stripes, the most you could lose is two drives. The performance of the two should be similar - perhaps the seek performance would be different for high concurrent use in PG. - Luke On 5/29/07 2:14 PM, "Rajesh Kumar Mallah" <mallah.rajesh@gmail.com> wrote: > hi, > > this is not really postgresql specific, but any help is appreciated. > i have read more spindles the better it is for IO performance. > > suppose i have 8 drives , should a stripe (raid0) be created on > 2 mirrors (raid1) of 4 drives each OR should a stripe on 4 mirrors > of 2 drives each be created ? > > also does single channel or dual channel controllers makes lot > of difference in raid10 performance ? > > regds > mallah. > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
On 5/30/07, Luke Lonergan <llonergan@greenplum.com> wrote: > Stripe of mirrors is preferred to mirror of stripes for the best balance of > protection and performance. nooo! i am not aksing raid10 vs raid01 . I am considering stripe of mirrors only. the question is how are more number of disks supposed to be BEST utilized in terms of IO performance for 1. for adding more mirrored stripes OR 2. for adding more harddrives to the mirrors. say i had 4 drives in raid10 format D1 raid1 D2 --> MD0 D3 raid1 D4 --> MD1 MD0 raid0 MD1 --> MDF (final) now i get 2 drives D5 and D6 the i got 2 options 1. create a new mirror D5 raid1 D6 --> MD2 MD0 raid0 MD1 raid0 MD2 --> MDF final OR D1 raid1 D2 raid1 D5 --> MD0 D3 raid1 D4 raid1 D6 --> MD1 MD0 raid0 MD1 --> MDF (final) thanks , hope my question is clear now. Regds mallah. > > In the stripe of mirrors you can lose up to half of the disks and still be > operational. In the mirror of stripes, the most you could lose is two > drives. The performance of the two should be similar - perhaps the seek > performance would be different for high concurrent use in PG. > > - Luke > > > On 5/29/07 2:14 PM, "Rajesh Kumar Mallah" <mallah.rajesh@gmail.com> wrote: > > > hi, > > > > this is not really postgresql specific, but any help is appreciated. > > i have read more spindles the better it is for IO performance. > > > > suppose i have 8 drives , should a stripe (raid0) be created on > > 2 mirrors (raid1) of 4 drives each OR should a stripe on 4 mirrors > > of 2 drives each be created ? > > > > also does single channel or dual channel controllers makes lot > > of difference in raid10 performance ? > > > > regds > > mallah. > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: In versions below 8.0, the planner will ignore your desire to > > choose an index scan if your joining column's datatypes do not > > match > > > > >
Hi Rajesh, On 5/29/07 7:18 PM, "Rajesh Kumar Mallah" <mallah.rajesh@gmail.com> wrote: > D1 raid1 D2 raid1 D5 --> MD0 > D3 raid1 D4 raid1 D6 --> MD1 > MD0 raid0 MD1 --> MDF (final) AFAIK you can't RAID1 more than two drives, so the above doesn't make sense to me. - Luke
* Luke Lonergan (llonergan@greenplum.com) wrote: > Hi Rajesh, > > On 5/29/07 7:18 PM, "Rajesh Kumar Mallah" <mallah.rajesh@gmail.com> wrote: > > > D1 raid1 D2 raid1 D5 --> MD0 > > D3 raid1 D4 raid1 D6 --> MD1 > > MD0 raid0 MD1 --> MDF (final) > > AFAIK you can't RAID1 more than two drives, so the above doesn't make sense > to me. It's just more copies of the same data if it's really a RAID1, for the extra, extra paranoid. Basically, in the example above, I'd read it as "D1, D2, D5 have identical data on them". Thanks, Stephen
Attachment
Stephen, On 5/29/07 8:31 PM, "Stephen Frost" <sfrost@snowman.net> wrote: > It's just more copies of the same data if it's really a RAID1, for the > extra, extra paranoid. Basically, in the example above, I'd read it as > "D1, D2, D5 have identical data on them". In that case, I'd say it's a waste of disk to add 1+2 redundancy to the mirrors. - Luke
On 5/29/07, Luke Lonergan <llonergan@greenplum.com> wrote: > AFAIK you can't RAID1 more than two drives, so the above doesn't make sense > to me. Yeah, I've never seen a way to RAID-1 more than 2 drives either. It would have to be his first one: D1 + D2 = MD0 (RAID 1) D3 + D4 = MD1 ... D5 + D6 = MD2 ... MD0 + MD1 + MD2 = MDF (RAID 0) -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/
On Wed, 30 May 2007, Jonah H. Harris wrote: > On 5/29/07, Luke Lonergan <llonergan@greenplum.com> wrote: >> AFAIK you can't RAID1 more than two drives, so the above doesn't make >> sense >> to me. > > Yeah, I've never seen a way to RAID-1 more than 2 drives either. It > would have to be his first one: > > D1 + D2 = MD0 (RAID 1) > D3 + D4 = MD1 ... > D5 + D6 = MD2 ... > MD0 + MD1 + MD2 = MDF (RAID 0) > I don't know what the failure mode ends up being, but on linux I had no problems creating what appears to be a massively redundant (but small) array md0 : active raid1 sdo1[10](S) sdn1[8] sdm1[7] sdl1[6] sdk1[5] sdj1[4] sdi1[3] sdh1[2] sdg1[9] sdf1[1] sde1[11](S) sdd1[0] 896 blocks [10/10] [UUUUUUUUUU] David Lang
On Wed, 30 May 2007, Jonah H. Harris wrote:
> On 5/29/07, Luke Lonergan <llonergan@greenplum.com> wrote:
>> AFAIK you can't RAID1 more than two drives, so the above doesn't make
>> sense
>> to me.
>
> Yeah, I've never seen a way to RAID-1 more than 2 drives either. It
> would have to be his first one:
>
> D1 + D2 = MD0 (RAID 1)
> D3 + D4 = MD1 ...
> D5 + D6 = MD2 ...
> MD0 + MD1 + MD2 = MDF (RAID 0)
>
I don't know what the failure mode ends up being, but on linux I had no
problems creating what appears to be a massively redundant (but small) array
md0 : active raid1 sdo1[10](S) sdn1[8] sdm1[7] sdl1[6] sdk1[5] sdj1[4] sdi1[3] sdh1[2] sdg1[9] sdf1[1] sde1[11](S) sdd1[0]
896 blocks [10/10] [UUUUUUUUUU]
David Lang
Good point, also if you had Raid 1 with 3 drives with some bit errors at least you can take a vote on whats right. Where as if you only have 2 and they disagree how do you know which is right other than pick one and hope... But whatever it will be slower to keep in sync on a heavy write system.
Peter.
* Peter Childs (peterachilds@gmail.com) wrote: > Good point, also if you had Raid 1 with 3 drives with some bit errors at > least you can take a vote on whats right. Where as if you only have 2 and > they disagree how do you know which is right other than pick one and hope... > But whatever it will be slower to keep in sync on a heavy write system. I'm not sure, but I don't think most RAID1 systems do reads against all drives and compare the results before returning it to the caller... I'd be curious if I'm wrong. Thanks, Stephen
Attachment
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On 5/29/07, Luke Lonergan <llonergan@greenplum.com> wrote: >> AFAIK you can't RAID1 more than two drives, so the above doesn't make sense >> to me. Sure you can. In fact it's a very common backup strategy. You build a three-way mirror and then when it comes time to back it up you break it into a two-way mirror and back up the orphaned array at your leisure. When it's done you re-add it and rebuild the degraded array. Good raid controllers can rebuild the array at low priority squeezing in the reads in idle cycles. I don't think you normally do it for performance though since there's more to be gained by using larger stripes. In theory you should get the same boost on reads as widening your stripes but of course you get no benefit on writes. And I'm not sure raid controllers optimize raid1 accesses well in practice either. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Hi Peter, On 5/30/07 12:29 AM, "Peter Childs" <peterachilds@gmail.com> wrote: > Good point, also if you had Raid 1 with 3 drives with some bit errors at least > you can take a vote on whats right. Where as if you only have 2 and they > disagree how do you know which is right other than pick one and hope... But > whatever it will be slower to keep in sync on a heavy write system. Much better to get a RAID system that checksums blocks so that "good" is known. Solaris ZFS does that, as do high end systems from EMC and HDS. - Luke
On Wed, May 30, 2007 at 07:06:54AM -0700, Luke Lonergan wrote: >On 5/30/07 12:29 AM, "Peter Childs" <peterachilds@gmail.com> wrote: >> Good point, also if you had Raid 1 with 3 drives with some bit errors at least >> you can take a vote on whats right. Where as if you only have 2 and they >> disagree how do you know which is right other than pick one and hope... But >> whatever it will be slower to keep in sync on a heavy write system. > >Much better to get a RAID system that checksums blocks so that "good" is >known. Solaris ZFS does that, as do high end systems from EMC and HDS. I don't see how that's better at all; in fact, it reduces to exactly the same problem: given two pieces of data which disagree, which is right? The ZFS hashes do a better job of error detection, but that's still not the same thing as a voting system (3 copies, 2 of 3 is correct answer) to resolve inconsistencies. Mike Stone
> I don't see how that's better at all; in fact, it reduces to > exactly the same problem: given two pieces of data which > disagree, which is right? The one that matches the checksum. - Luke
On Wed, May 30, 2007 at 10:36:48AM -0400, Luke Lonergan wrote: >> I don't see how that's better at all; in fact, it reduces to >> exactly the same problem: given two pieces of data which >> disagree, which is right? > >The one that matches the checksum. And you know the checksum is good, how? Mike Stone
It's created when the data is written to both drives.
This is standard stuff, very well proven: try googling 'self healing zfs'.
- Luke
Msg is shrt cuz m on ma treo
-----Original Message-----
From: Michael Stone [mailto:mstone+postgres@mathom.us]
Sent: Wednesday, May 30, 2007 11:11 AM Eastern Standard Time
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] setting up raid10 with more than 4 drives
On Wed, May 30, 2007 at 10:36:48AM -0400, Luke Lonergan wrote:
>> I don't see how that's better at all; in fact, it reduces to
>> exactly the same problem: given two pieces of data which
>> disagree, which is right?
>
>The one that matches the checksum.
And you know the checksum is good, how?
Mike Stone
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org
"Michael Stone" <mstone+postgres@mathom.us> writes: "Michael Stone" <mstone+postgres@mathom.us> writes: > On Wed, May 30, 2007 at 07:06:54AM -0700, Luke Lonergan wrote: > > > Much better to get a RAID system that checksums blocks so that "good" is > > known. Solaris ZFS does that, as do high end systems from EMC and HDS. > > I don't see how that's better at all; in fact, it reduces to exactly the same > problem: given two pieces of data which disagree, which is right? Well, the one where the checksum is correct. In practice I've never seen a RAID failure due to outright bad data. In my experience when a drive goes bad it goes really bad and you can't read the block at all without i/o errors. In every case where I've seen bad data it was due to bad memory (in one case bad memory in the RAID controller cache -- that was hell to track down). Checksums aren't even enough in that case as you'll happily generate a checksum for the bad data before storing it... -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Wed, 30 May 2007 16:36:48 +0200, Luke Lonergan <LLonergan@greenplum.com> wrote: >> I don't see how that's better at all; in fact, it reduces to >> exactly the same problem: given two pieces of data which >> disagree, which is right? > > The one that matches the checksum. - postgres tells OS "write this block" - OS sends block to drives A and B - drive A happens to be lucky and seeks faster, writes data - student intern carrying pizzas for senior IT staff trips over power cord* - boom - drive B still has old block Both blocks have correct checksum, so only a version counter/timestamp could tell. Fortunately if fsync() is honored correctly (did you check ?) postgres will zap such errors in recovery. Smart RAID1 or 0+1 controllers (including software RAID) will distribute random reads to both disks (but not writes obviously). * = this happened at my old job, yes they had a very frightening server room, or more precisely "cave" ; I never went there, I didn't want to be the one fired for tripping over the wire... From Linux Software RAID howto : - benchmarking (quite brief !) http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO-9.html#ss9.5 - read "Data Scrubbing" here : http://gentoo-wiki.com/HOWTO_Install_on_Software_RAID - yeah but does it work ? (scary) http://bugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=405919 md/sync_action This can be used to monitor and control the resync/recovery process of MD. In particular, writing "check" here will cause the array to read all data block and check that they are consistent (e.g. parity is correct, or all mirror replicas are the same). Any discrepancies found are NOT corrected. A count of problems found will be stored in md/mismatch_count. Alternately, "repair" can be written which will cause the same check to be performed, but any errors will be corrected.
Oh by the way, I saw a nifty patch in the queue : Find a way to reduce rotational delay when repeatedly writing last WAL page Currently fsync of WAL requires the disk platter to perform a full rotation to fsync again. One idea is to write the WAL to different offsets that might reduce the rotational delay. This will not work if the WAL is on RAID1, because two disks never spin exactly at the same speed...
> This is standard stuff, very well proven: try googling 'self healing zfs'.
The first hit on this search is a demo of ZFS detecting corruption of one of the mirror pair using checksums, very cool:
http://www.opensolaris.org/os/community/zfs/demos/selfheal/;jsessionid=52508D464883F194061E341F58F4E7E1
The bad drive is pointed out directly using the checksum and the data integrity is preserved.
- Luke
On Wed, May 30, 2007 at 08:51:45AM -0700, Luke Lonergan wrote: > > This is standard stuff, very well proven: try googling 'self healing zfs'. > The first hit on this search is a demo of ZFS detecting corruption of one of > the mirror pair using checksums, very cool: > > http://www.opensolaris.org/os/community/zfs/demos/selfheal/;jsessionid=52508 > D464883F194061E341F58F4E7E1 > > The bad drive is pointed out directly using the checksum and the data > integrity is preserved. One part is corruption. Another is ordering and consistency. ZFS represents both RAID-style storage *and* journal-style file system. I imagine consistency and ordering is handled through journalling. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
Sorry for posting and disappearing. i am still not clear what is the best way of throwing in more disks into the system. does more stripes means more performance (mostly) ? also is there any thumb rule about best stripe size ? (8k,16k,32k...) regds mallah On 5/30/07, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote: > On Wed, May 30, 2007 at 08:51:45AM -0700, Luke Lonergan wrote: > > > This is standard stuff, very well proven: try googling 'self healing zfs'. > > The first hit on this search is a demo of ZFS detecting corruption of one of > > the mirror pair using checksums, very cool: > > > > http://www.opensolaris.org/os/community/zfs/demos/selfheal/;jsessionid=52508 > > D464883F194061E341F58F4E7E1 > > > > The bad drive is pointed out directly using the checksum and the data > > integrity is preserved. > > One part is corruption. Another is ordering and consistency. ZFS represents > both RAID-style storage *and* journal-style file system. I imagine consistency > and ordering is handled through journalling. > > Cheers, > mark > > -- > mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ > . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder > |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | > | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada > > One ring to rule them all, one ring to find them, one ring to bring them all > and in the darkness bind them... > > http://mark.mielke.cc/ > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
Mark, On 5/30/07 8:57 AM, "mark@mark.mielke.cc" <mark@mark.mielke.cc> wrote: > One part is corruption. Another is ordering and consistency. ZFS represents > both RAID-style storage *and* journal-style file system. I imagine consistency > and ordering is handled through journalling. Yep and versioning, which answers PFC's scenario. Short answer: ZFS has a very reliable model that uses checksumming and journaling along with block versioning to implement "self healing". There are others that do some similar things with checksumming on the SAN HW and cooperation with the filesystem. - Luke
On Thu, May 31, 2007 at 01:28:58AM +0530, Rajesh Kumar Mallah wrote: > i am still not clear what is the best way of throwing in more > disks into the system. > does more stripes means more performance (mostly) ? > also is there any thumb rule about best stripe size ? (8k,16k,32k...) It isn't that simple. RAID1 should theoretically give you the best read performance. If all you care about is read, then "best performance" would be to add more mirrors to your array. For write performance, RAID0 is the best. I think this is what you mean by "more stripes". This is where RAID 1+0/0+1 come in. To reconcile the above. Your question seems to be: I have a RAID 1+0/0+1 system. Should I add disks onto the 0 part of the array? Or the 1 part of the array? My conclusion to you would be: Both, unless you are certain that you load is scaled heavily towards read, in which case the 1, or if scaled heavily towards write, then 0. Then comes the other factors. Do you want redundancy? Then you want 1. Do you want capacity? Then you want 0. There is no single answer for most people. For me, stripe size is the last decision to make, and may be heavily sensitive to load patterns. This suggests a trial and error / benchmarking requirement to determine the optimal stripe size for your use. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
On 5/31/07, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote: > On Thu, May 31, 2007 at 01:28:58AM +0530, Rajesh Kumar Mallah wrote: > > i am still not clear what is the best way of throwing in more > > disks into the system. > > does more stripes means more performance (mostly) ? > > also is there any thumb rule about best stripe size ? (8k,16k,32k...) > > It isn't that simple. RAID1 should theoretically give you the best read > performance. If all you care about is read, then "best performance" would > be to add more mirrors to your array. > > For write performance, RAID0 is the best. I think this is what you mean > by "more stripes". > > This is where RAID 1+0/0+1 come in. To reconcile the above. Your question > seems to be: I have a RAID 1+0/0+1 system. Should I add disks onto the 0 > part of the array? Or the 1 part of the array? > My conclusion to you would be: Both, unless you are certain that you load > is scaled heavily towards read, in which case the 1, or if scaled heavily > towards write, then 0. thanks . this answers to my query. all the time i was thinking of 1+0 only failing to observe the 0+1 part in it. > > Then comes the other factors. Do you want redundancy? Then you want 1. > Do you want capacity? Then you want 0. Ok. > > There is no single answer for most people. > > For me, stripe size is the last decision to make, and may be heavily > sensitive to load patterns. This suggests a trial and error / benchmarking > requirement to determine the optimal stripe size for your use. thanks. mallah. > > Cheers, > mark > > -- > mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ > . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder > |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | > | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada > > One ring to rule them all, one ring to find them, one ring to bring them all > and in the darkness bind them... > > http://mark.mielke.cc/ > >
On Wed, May 30, 2007 at 12:41:46AM -0400, Jonah H. Harris wrote: > Yeah, I've never seen a way to RAID-1 more than 2 drives either. pannekake:~> grep -A 1 md0 /proc/mdstat md0 : active raid1 dm-20[2] dm-19[1] dm-18[0] 64128 blocks [3/3] [UUU] It's not a big device, but I can ensure you it exists :-) /* Steinar */ -- Homepage: http://www.sesse.net/
Hi, Op 1-jun-2007, om 1:39 heeft Steinar H. Gunderson het volgende geschreven: > On Wed, May 30, 2007 at 12:41:46AM -0400, Jonah H. Harris wrote: >> Yeah, I've never seen a way to RAID-1 more than 2 drives either. > > pannekake:~> grep -A 1 md0 /proc/mdstat > md0 : active raid1 dm-20[2] dm-19[1] dm-18[0] > 64128 blocks [3/3] [UUU] > > It's not a big device, but I can ensure you it exists :-) I talked to someone yesterday who did a 10 or 11 way RAID1 with Linux MD for high performance video streaming. Seemed to work very well. - Sander
Apologies for a somewhat off-topic question, but... The Linux kernel doesn't properly detect my software RAID1+0 when I boot up. It detects the two RAID1 arrays, the partitionsof which are marked properly. But it can't find the RAID0 on top of that, because there's no corresponding deviceto auto-detect. The result is that it creates /dev/md0 and /dev/md1 and assembles the RAID1 devices on bootup, but/dev/md2 isn't created, so the RAID0 can't be assembled at boot time. Here's what it looks like: $ cat /proc/mdstat Personalities : [raid0] [raid1] md2 : active raid0 md0[0] md1[1] 234436224 blocks 64k chunks md1 : active raid1 sde1[1] sdc1[2] 117218176 blocks [2/2] [UU] md0 : active raid1 sdd1[1] sdb1[0] 117218176 blocks [2/2] [UU] $ uname -r 2.6.12-1.1381_FC3 After a reboot, I always have to do this: mknod /dev/md2 b 9 2 mdadm --assemble /dev/md2 /dev/md0 /dev/md1 mount /dev/md2 What am I missing here? Thanks, Craig
Craig, to make things working properly here you need to create a config file keeping both raid1 and raid0 information (/etc/mdadm/mdadm.conf). However if your root filesystem is corrupted, or you loose this file, or move disks somewhere else - you are back to the same initial issue :)) So, the solution I've found 100% working in any case is: use mdadm to create raid1 devices (as you do already) and then use LVM to create raid0 volume on it - LVM writes its own labels on every MD devices and will find its volumes peaces automatically! Tested for crash several times and was surprised by its robustness :)) Rgds, -Dimitri On 6/1/07, Craig James <craig_james@emolecules.com> wrote: > Apologies for a somewhat off-topic question, but... > > The Linux kernel doesn't properly detect my software RAID1+0 when I boot up. > It detects the two RAID1 arrays, the partitions of which are marked > properly. But it can't find the RAID0 on top of that, because there's no > corresponding device to auto-detect. The result is that it creates /dev/md0 > and /dev/md1 and assembles the RAID1 devices on bootup, but /dev/md2 isn't > created, so the RAID0 can't be assembled at boot time. > > Here's what it looks like: > > $ cat /proc/mdstat > Personalities : [raid0] [raid1] > md2 : active raid0 md0[0] md1[1] > 234436224 blocks 64k chunks > > md1 : active raid1 sde1[1] sdc1[2] > 117218176 blocks [2/2] [UU] > > md0 : active raid1 sdd1[1] sdb1[0] > 117218176 blocks [2/2] [UU] > > $ uname -r > 2.6.12-1.1381_FC3 > > After a reboot, I always have to do this: > > mknod /dev/md2 b 9 2 > mdadm --assemble /dev/md2 /dev/md0 /dev/md1 > mount /dev/md2 > > What am I missing here? > > Thanks, > Craig > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
Dimitri, LVM is great, one thing to watch out for: it is very slow compared to pure md. That will only matter in practice if you want to exceed 1GB/s of sequential I/O bandwidth. - Luke On 6/1/07 11:51 AM, "Dimitri" <dimitrik.fr@gmail.com> wrote: > Craig, > > to make things working properly here you need to create a config file > keeping both raid1 and raid0 information (/etc/mdadm/mdadm.conf). > However if your root filesystem is corrupted, or you loose this file, > or move disks somewhere else - you are back to the same initial issue > :)) > > So, the solution I've found 100% working in any case is: use mdadm to > create raid1 devices (as you do already) and then use LVM to create > raid0 volume on it - LVM writes its own labels on every MD devices and > will find its volumes peaces automatically! Tested for crash several > times and was surprised by its robustness :)) > > Rgds, > -Dimitri > > On 6/1/07, Craig James <craig_james@emolecules.com> wrote: >> Apologies for a somewhat off-topic question, but... >> >> The Linux kernel doesn't properly detect my software RAID1+0 when I boot up. >> It detects the two RAID1 arrays, the partitions of which are marked >> properly. But it can't find the RAID0 on top of that, because there's no >> corresponding device to auto-detect. The result is that it creates /dev/md0 >> and /dev/md1 and assembles the RAID1 devices on bootup, but /dev/md2 isn't >> created, so the RAID0 can't be assembled at boot time. >> >> Here's what it looks like: >> >> $ cat /proc/mdstat >> Personalities : [raid0] [raid1] >> md2 : active raid0 md0[0] md1[1] >> 234436224 blocks 64k chunks >> >> md1 : active raid1 sde1[1] sdc1[2] >> 117218176 blocks [2/2] [UU] >> >> md0 : active raid1 sdd1[1] sdb1[0] >> 117218176 blocks [2/2] [UU] >> >> $ uname -r >> 2.6.12-1.1381_FC3 >> >> After a reboot, I always have to do this: >> >> mknod /dev/md2 b 9 2 >> mdadm --assemble /dev/md2 /dev/md0 /dev/md1 >> mount /dev/md2 >> >> What am I missing here? >> >> Thanks, >> Craig >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 1: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly >> > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
On Fri, Jun 01, 2007 at 10:57:56AM -0700, Craig James wrote: > The Linux kernel doesn't properly detect my software RAID1+0 when I boot > up. It detects the two RAID1 arrays, the partitions of which are marked > properly. But it can't find the RAID0 on top of that, because there's no > corresponding device to auto-detect. The result is that it creates > /dev/md0 and /dev/md1 and assembles the RAID1 devices on bootup, but > /dev/md2 isn't created, so the RAID0 can't be assembled at boot time. Either do your md discovery in userspace via mdadm (your distribution can probably help you with this), or simply use the raid10 module instead of building raid1+0 yourself. /* Steinar */ -- Homepage: http://www.sesse.net/
> On Fri, Jun 01, 2007 at 10:57:56AM -0700, Craig James wrote: > > The Linux kernel doesn't properly detect my software RAID1+0 when I boot > > up. It detects the two RAID1 arrays, the partitions of which are marked > > properly. But it can't find the RAID0 on top of that, because there's no > > corresponding device to auto-detect. The result is that it creates > > /dev/md0 and /dev/md1 and assembles the RAID1 devices on bootup, but > > /dev/md2 isn't created, so the RAID0 can't be assembled at boot time. Hi Craig: I had the same problem for a short time. There *is* a device to base the RAID0 off, however, it needs to be recursively detected. mdadm will do this for you, however, if the device order isn't optimal, it may need some help via /etc/mdadm.conf. For a while, I used something like: DEVICE partitions ... ARRAY /dev/md3 level=raid0 num-devices=2 UUID=10d58416:5cd52161:7703b48e:cd93a0e0 ARRAY /dev/md5 level=raid1 num-devices=2 UUID=1515ac26:033ebf60:fa5930c5:1e1f0f12 ARRAY /dev/md6 level=raid1 num-devices=2 UUID=72ddd3b6:b063445c:d7718865:bb79aad7 My symptoms were that it worked where started from user space, but failed during reboot without the above hints. I believe if I had defined md5 and md6 before md3, it may have worked automatically without hints. On Fri, Jun 01, 2007 at 11:35:01PM +0200, Steinar H. Gunderson wrote: > Either do your md discovery in userspace via mdadm (your distribution can > probably help you with this), or simply use the raid10 module instead of > building raid1+0 yourself. I agree with using the mdadm RAID10 support. RAID1+0 has the flexibility of allowing you to fine-control the RAID1 vs RAID0 if you want to add disks later. RAID10 from mdadm has the flexibility that you don't need an even number of disks. As I don't intend to add disks to my array - the RAID10 as a single layer, with potentially better intelligence in terms of performance, appeals to me. They both worked for me - but I am sticking with the single layer now. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
Steinar, On 6/1/07 2:35 PM, "Steinar H. Gunderson" <sgunderson@bigfoot.com> wrote: > Either do your md discovery in userspace via mdadm (your distribution can > probably help you with this), or simply use the raid10 module instead of > building raid1+0 yourself. I found md raid10 to be *very* slow compared to raid1+0 on Linux 2.6.9 -> 2.6.18. Very slow in this case is < 400 MB/s compared to 1,800 MB/s. - Luke