Re: Hardware vs Software RAID - Mailing list pgsql-performance
From | Peter T. Breuer |
---|---|
Subject | Re: Hardware vs Software RAID |
Date | |
Msg-id | 200806261349.m5QDniN6026724@betty.it.uc3m.es Whole thread Raw |
In response to | Re: Hardware vs Software RAID ("Merlin Moncure" <mmoncure@gmail.com>) |
Responses |
Re: Hardware vs Software RAID
|
List | pgsql-performance |
"Also sprach Merlin Moncure:" > As discussed down thread, software raid still gets benefits of > write-back caching on the raid controller...but there are a couple of (I wish I knew what write-back caching was!) Well, if you mean the Linux software raid driver, no, there's no extra caching (buffering). Every request arriving at the device is duplicated (for RAID1), using a local finite cache of buffer head structures and real extra muffers from the kernel's general resources. Every arriving request is dispatched two its subtargets as it arrives (as two or more new requests). On reception of both (or more) acks, the original request is acked, and not before. This imposes a considerable extra resource burden. It's a mystery to me why the driver doesn't deadlock against other resource eaters that it may depend on. Writing to a device that also needs extra memory per request in its driver should deadlock it, in theory. Against a network device as component, it's a problem (tcp needs buffers). However the lack of extra buffering is really deliberate (double buffering is a horrible thing in many ways, not least because of the probable memory deadlock against some component driver's requirement). The driver goes to the lengths of replacing the kernel's generic make_request function just for itself in order to make sure full control resides in the driver. This is required, among other things, to make sure that request order is preserved, and that requests. It has the negative that standard kernel contiguous request merging does not take place. But that's really required for sane coding in the driver. Getting request pages into general kernel buffers ... may happen. > things I'd like to add. First, if your sever is extremely busy, the > write back cache will eventually get overrun and performance will > eventually degrade to more typical ('write through') performance. I'd like to know where this 'write back cache' �s! (not to mention what it is :). What on earth does `write back' mean? Peraps you mean the kernel's general memory system, which has the effect of buffering and caching requests on the way to drivers like raid. Yes, if you write to a device, any device, you will only write to the kernel somwhere, which may or may not decide now or later to send the dirty buffers thus created on to the driver in question, either one by one or merged. But as I said, raid replaces most of the kernel's mechanisms in that area (make_request, plug) to avoid losing ordering. I would be surprised if the raw device exhibited any buffering at all after getting rid of the generic kernel mechanisms. Any buffering you see would likely be happening at file system level (and be a darn nuisance). Reads from the device are likely to hit the kernel's existing buffers first, thus making them act as a "cache". > Secondly, many hardware raid controllers have really nasty behavior in > this scenario. Linux software raid has decent degradation in overload I wouldn't have said so! If there is any, it's sort of accidental. On memory starvation, the driver simply couldn't create and despatch component requests. Dunno what happens then. It won't run out of buffer head structs though, since it's pretty well serialised on those, per device, in order to maintain request order, and it has its own cache. > conditions but many popular raid controllers (dell perc/lsi logic sas > for example) become unpredictable and very bursty in sustained high > load conditions. Well, that's because they can't tell the linux memory manager to quit storing data from them in memory and let them have it NOW (a general problem .. how one gets feedback on the mm state, I don't know). Maybe one could .. one can control buffer aging pretty much per device nowadays. Perhaps one can set the limit to zero for buffer age in memory before being sent to the device. That would help. Also one can lower the bdflush limit at which the device goes sync. All that would help against bursty performance, but it would slow ordinary operation towards sync behaviour. > As greg mentioned, I trust the linux kernel software raid much more > than the black box hw controllers. Also, contrary to vast popular Well, it's readable code. That's the basis for my comments! > mythology, the 'overhead' of sw raid in most cases is zero except in > very particular conditions. It's certainly very small. It would be smaller still if we could avoid needing new buffers per device. Perhaps the dm multipathing allows that. Peter
pgsql-performance by date: