Thread: best use of an EMC SAN
Assuming we have 24 73G drives is it better to make one big metalun and carve it up and let the SAN manage the where everything is, or is it better to specify which spindles are where. Currently we would require 3 separate disk arrays. one for the main database, second one for WAL logs, third one we use for the most active table. Problem with dedicating the spindles to each array is that we end up wasting space. Are the SAN's smart enough to do a better job if I create one large metalun and cut it up ? Dave
We do something similar here. We use Netapp and I carve one aggregate per data volume. I generally keep the pg_xlog on the same "data" LUN, but I don't mix other databases on the same aggregate. In the NetApp world because they use RAID DP (dual parity) you have a higher wastage of drives, however, you are guaranteed that an erroneous query won't clobber the IO of another database. In my experience, NetApp has utilities that set "IO priority" but it's not granular enough as it's more like using "renice" in unix. It doesn't really make that big of a difference. My recommendation, each database gets it's own aggregate unless the IO footprint is very low. Let me know if you need more details. Regards, Dan Gorman On Jul 11, 2007, at 6:03 AM, Dave Cramer wrote: > Assuming we have 24 73G drives is it better to make one big metalun > and carve it up and let the SAN manage the where everything is, or > is it better to specify which spindles are where. > > Currently we would require 3 separate disk arrays. > > one for the main database, second one for WAL logs, third one we > use for the most active table. > > Problem with dedicating the spindles to each array is that we end > up wasting space. Are the SAN's smart enough to do a better job if > I create one large metalun and cut it up ? > > Dave > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
"Dave Cramer" <pg@fastcrypt.com> writes: > Assuming we have 24 73G drives is it better to make one big metalun and carve > it up and let the SAN manage the where everything is, or is it better to > specify which spindles are where. This is quite a controversial question with proponents of both strategies. I would suggest having one RAID-1 array for the WAL and throw the rest of the drives at a single big array for the data files. That wastes space since the WAL isn't big but the benefit is big. If you have a battery backed cache you might not need even that. Just throwing them all into a big raid might work just as well. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 11-Jul-07, at 10:05 AM, Gregory Stark wrote: > "Dave Cramer" <pg@fastcrypt.com> writes: > >> Assuming we have 24 73G drives is it better to make one big >> metalun and carve >> it up and let the SAN manage the where everything is, or is it >> better to >> specify which spindles are where. > > This is quite a controversial question with proponents of both > strategies. > > I would suggest having one RAID-1 array for the WAL and throw the > rest of the This is quite unexpected. Since the WAL is primarily all writes, isn't a RAID 1 the slowest of all for writing ? > drives at a single big array for the data files. That wastes space > since the > WAL isn't big but the benefit is big. > > If you have a battery backed cache you might not need even that. > Just throwing > them all into a big raid might work just as well. Any ideas on how to test this before we install the database ? > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com >
In my sporadic benchmark testing, the only consistent 'trick' I found was that the best thing I could do for performance sequential performance was allocating a bunch of mirrored pair LUNs and stripe them with software raid. This made a huge difference (~2X) in sequential performance, and a little boost in random i/o - at least in FLARE 19. On our CX-500s, FLARE failed to fully utilize the secondary drives in RAID 1+0 configurations. FWIW, after several months of inquiries, EMC eventually explained that this is due to their desire to ease the usage and thus wear on the secondaries in order to reduce the likelihood of a mirrored pair both failing. We've never observed a difference using separate WAL LUNs - presumably due to the write cache. That said, we continue to use them figuring it's "cheap" insurance against running out of space as well as performance under conditions we didn't see while testing. We also ended up using single large LUNs for data, but I must admit I wanted more time to benchmark splitting off heavily hit tables. My advice would be to read the EMC performance white papers, remain skeptical, and then test everything yourself. :D On Wed, 2007-07-11 at 09:03 -0400, Dave Cramer wrote: > Assuming we have 24 73G drives is it better to make one big metalun > and carve it up and let the SAN manage the where everything is, or is > it better to specify which spindles are where. > > Currently we would require 3 separate disk arrays. > > one for the main database, second one for WAL logs, third one we use > for the most active table. > > Problem with dedicating the spindles to each array is that we end up > wasting space. Are the SAN's smart enough to do a better job if I > create one large metalun and cut it up ? > > Dave > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
On Wed, Jul 11, 2007 at 09:03:27AM -0400, Dave Cramer wrote: > Problem with dedicating the spindles to each array is that we end up > wasting space. Are the SAN's smart enough to do a better job if I > create one large metalun and cut it up ? In my experience, this largely depends on your SAN and its hard- and firm-ware, as well as its ability to interact with the OS. I think the best answer is "sometimes yes". A -- Andrew Sullivan | ajs@crankycanuck.ca However important originality may be in some fields, restraint and adherence to procedure emerge as the more significant virtues in a great many others. --Alain de Botton
pg@fastcrypt.com (Dave Cramer) writes: > On 11-Jul-07, at 10:05 AM, Gregory Stark wrote: > >> "Dave Cramer" <pg@fastcrypt.com> writes: >> >>> Assuming we have 24 73G drives is it better to make one big >>> metalun and carve >>> it up and let the SAN manage the where everything is, or is it >>> better to >>> specify which spindles are where. >> >> This is quite a controversial question with proponents of both >> strategies. >> >> I would suggest having one RAID-1 array for the WAL and throw the >> rest of the > > This is quite unexpected. Since the WAL is primarily all writes, > isn't a RAID 1 the slowest of all for writing ? The thing is, the disk array caches this LIKE CRAZY. I'm not quite sure how many batteries are in there to back things up; there seems to be multiple levels of such, which means that as far as fsync() is concerned, the data is committed very quickly even if it takes a while to physically hit disk. One piece of the controversy will be that the disk being used for WAL is certain to be written to as heavily and continuously as your heavy load causes. A fallout of this is that those disks are likely to be worked harder than the disk used for storing "plain old data," with the result that if you devote disk to WAL, you'll likely burn thru replacement drives faster there than you do for the "POD" disk. It is not certain whether it is more desirable to: a) Spread that wear and tear across the whole array, or b) Target certain disks for that wear and tear, and expect to need to replace them somewhat more frequently. At some point, I'd like to do a test on a decent disk array where we take multiple configurations. Assuming 24 drives: - Use all 24 to make "one big filesystem" as the base case - Split off a set (6?) for WAL - Split off a set (6? 9?) to have a second tablespace, and shift indices there My suspicion is that the "use all 24 for one big filesystem" scenario is likely to be fastest by some small margin, and that the other cases will lose a very little bit in comparison. Andrew Sullivan had a somewhat similar finding a few years ago on some old Solaris hardware that unfortunately isn't at all relevant today. He basically found that moving WAL off to separate disk didn't affect performance materially. What's quite regrettable is that it is almost sure to be difficult to construct a test that, on a well-appointed modern disk array, won't basically stay in cache. -- let name="cbbrowne" and tld="acm.org" in name ^ "@" ^ tld;; http://linuxdatabases.info/info/nonrdbms.html 16-inch Rotary Debugger: A highly effective tool for locating problems in computer software. Available for delivery in most major metropolitan areas. Anchovies contribute to poor coding style.
On Jul 11, 2007, at 12:39 PM, Chris Browne wrote: > - Split off a set (6?) for WAL In my limited testing, 6 drives for WAL would be complete overkill in almost any case. The only example I've ever seen where WAL was able to swamp 2 drives was the DBT testing that Mark Wong was doing at OSDL; the only reason that was the case is because he had somewhere around 70 data drives. I suppose an entirely in-memory database might be able to swamp a 2 drive WAL as well. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Wed, Jul 11, 2007 at 01:39:39PM -0400, Chris Browne wrote: > load causes. A fallout of this is that those disks are likely to be > worked harder than the disk used for storing "plain old data," with > the result that if you devote disk to WAL, you'll likely burn thru > replacement drives faster there than you do for the "POD" disk. This is true, and in operation can really burn you when you start to blow out disks. In particular, remember to factor the cost of RAID re-build into your RAID plans. Because you're going to be doing it, and if your WAL is near to its I/O limits, the only way you're going to get your redundancy back is to go noticably slower :-( > will lose a very little bit in comparison. Andrew Sullivan had a > somewhat similar finding a few years ago on some old Solaris hardware > that unfortunately isn't at all relevant today. He basically found > that moving WAL off to separate disk didn't affect performance > materially. Right, but it's not only the hardware that isn't relevant there. It was also using either 7.1 or 7.2, which means that the I/O pattern was completely different. More recently, ISTR, we did analysis for at least one workload that tod us to use separate LUNs for WAL, with separate I/O paths. This was with at least one kind of array supported by Awful Inda eXtreme. Other tests, IIRC, came out differently -- the experience with one largish EMC array was I think a dead heat between various strategies (so the additional flexibility of doing everything on the array was worth any cost we were able to measure). But the last time I had to be responsible for that sort of test was again a couple years ago. On the whole, though, my feeling is that you can't make general recommendations on this topic: the advances in storage are happening too fast to make generalisations, particularly in the top classes of hardware. A -- Andrew Sullivan | ajs@crankycanuck.ca The plural of anecdote is not data. --Roger Brinner
On Wed, 11 Jul 2007, Jim Nasby wrote: > I suppose an entirely in-memory database might be able to swamp a 2 > drive WAL as well. You can really generate a whole lot of WAL volume on an EMC SAN if you're doing UPDATEs fast enough on data that is mostly in-memory. Takes a fairly specific type of application to do that though, and whether you'll ever find it outside of a benchmark is hard to say. The main thing I would add as a consideration here is that you can configure PostgreSQL to write WAL data using the O_DIRECT path, bypassing the OS buffer cache, and greatly improve performance into SAN-grade hardware like this. That can be a big win if you're doing writes that dirty lots of WAL, and the benefit is straightforward to measure if the WAL is a dedicated section of disk (just change the wal_sync_method and do benchmarks with each setting). If the WAL is just another section on an array, how well those synchronous writes will mesh with the rest of the activity on the system is not as straightforward to predict. Having the WAL split out provides a logical separation that makes figuring all this out easier. Just to throw out a slightly different spin on the suggestions going by here: consider keeping the WAL separate, starting as a RAID-1 volume, but keep 2 disks in reserve so that you could easily upgrade to a 0+1 set if you end up discovering you need to double the write bandwidth. Since there's never much actual data on the WAL disks that would a fairly short downtime operation. If you don't reach a wall, the extra drives might serve as spares to help mitigate concerns about the WAL drives burning out faster than average because of their high write volume. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 11-Jul-07, at 2:35 PM, Greg Smith wrote: > On Wed, 11 Jul 2007, Jim Nasby wrote: > >> I suppose an entirely in-memory database might be able to swamp a >> 2 drive WAL as well. > > You can really generate a whole lot of WAL volume on an EMC SAN if > you're doing UPDATEs fast enough on data that is mostly in-memory. > Takes a fairly specific type of application to do that though, and > whether you'll ever find it outside of a benchmark is hard to say. > Well, this is such an application. The db fits entirely in memory, and the site is doing over 12M page views a day (I'm not exactly sure what this translates to in transactions) . > The main thing I would add as a consideration here is that you can > configure PostgreSQL to write WAL data using the O_DIRECT path, > bypassing the OS buffer cache, and greatly improve performance into > SAN-grade hardware like this. That can be a big win if you're > doing writes that dirty lots of WAL, and the benefit is > straightforward to measure if the WAL is a dedicated section of > disk (just change the wal_sync_method and do benchmarks with each > setting). If the WAL is just another section on an array, how well > those synchronous writes will mesh with the rest of the activity on > the system is not as straightforward to predict. Having the WAL > split out provides a logical separation that makes figuring all > this out easier. > > Just to throw out a slightly different spin on the suggestions > going by here: consider keeping the WAL separate, starting as a > RAID-1 volume, but keep 2 disks in reserve so that you could easily > upgrade to a 0+1 set if you end up discovering you need to double > the write bandwidth. Since there's never much actual data on the > WAL disks that would a fairly short downtime operation. If you > don't reach a wall, the extra drives might serve as spares to help > mitigate concerns about the WAL drives burning out faster than > average because of their high write volume. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com > Baltimore, MD > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq