Oracle Scratchpad

February 5, 2007

Go-faster stripes

Filed under: Infrastructure,Performance,Tuning — Jonathan Lewis @ 10:55 am GMT Feb 5,2007

A few days ago on the Oracle-L listserver, someone raised the topic of SANs and disk performance in a question about Oracle’s SAME (stripe and mirror everything) strategy.

In outline, their system had an EMC box with 120 disks of 256GB configured as RAID 1+0 (mirrored and striped). The disks were sliced into 8GB “hypervolumes” (EMC-speak) each, and sets of “hypervolumes” had been grouped together into “metavolumes” (EMC-speak again) which are made visible to the operating system as logical devices.

The poster did not state explicitly whether the hypervolumes were simply concatenated to form the metavolumes, or whether they had used EMC striping (which is done at 960KB)  to generate the metavolumes, nor was there an explicit statement of how many hypers went into a single meta.

In the words of the poster, though:

“The EMC box is used by over 10 databases, from OLTP to DWH and some of them Hybrid configuration. The problem with our disks is that sometimes we get like 80ms response time and most of time between 25ms and 60ms”.

One of the most interesting responses to this post came from Kevin Closson, who referred to a recent review he had written of an Oracle TPC-H benchmark. How does Oracle Corp. make datawarehouses run fast – easy, they use 3,072 disks of 36GB each, and then leave them 40% empty.

Another response, however, did such a good job of summing up the discussion of SANs and how to use/abuse them that I asked the author if I could quote his email in its entirety. Here, from Mark Farnham, is what I would like to call the thinking DBA’s guide to SAN technology.

Subject: RE: How many of you use S.A.M.E?

Okay, so there are whole books and many papers good, bad, and ugly on this topic.

Grossly oversimplifying, and remembering that cache mitigates the downside of I/O demand collisions, SAME operates like a statmux, that is every requesting process sticks its straw into a single firehose (or garden hose if you’re unlucky) and drinks and blows bubbles in competition with all the other processes and their straws.

I think it was JL who remarked emphatically that he would prefer that if someone else running a database on the same disk farm as him wanted to destroy their own performance, that was okay with him but he would prefer that they could not destroy his performance. Whether that is parceled out as different tablespaces isolated from each other within a single database or multiple databases doesn’t matter much for the central bit I’m trying to convey. SAME avoids hot spots and tends to even out the load and that by definition means if one user is making the disk farm go crazy everyone suffers equally. That is neither all good nor all bad.

Let’s say you have three databases designed to serve EMEA (Europe, the Middle East, and Africa), AMER (The Americas Region, you know from north of Canada all the way south to that bit that almost reaches Anarctica), and ASIA. If those are peak load oriented to 9AM to 5PM in the local time zones and you smear everything across all the disks evenly like SAME, then you effectively get triple the I/O horsepower for each when you need it. That is the polar case where SAME shines best.

Now let’s say you have three applications that don’t share data between them but which simultaneously peak in activity (meaning I/O demand in this case). SAME will minimize hot spots, but it will also maximize seek, read, and write collisions. (I guess DBWR will migitate the write collisions somewhat, especially if you have segregated the redo destinations from the database files [ignoring SAME in that case]).

What if two of the applications are beating the disk drives to death with batch jobs and one of the applications is trying to service interactive user requests? You lose. SAME applies the pain equally to everyone.

Now I’m not sure what became of a paper by Gary Sharpe that I helped write, but it had the neatest pictures of a big disk farm and how it could quickly become incomprehensible for humans to make good choices (like in [the original poster’s] case of 120 disks with 32 slices each) in the assembly of volumes for Oracle (or anything else) to use. By the way, I’m looking for that paper if anyone has a copy with the animated powerpoint. I suppose I could redo the work, but that thing is a piece of art and I wouldn’t do it justice. We introduced the concept of “stripe sets”, that is if you take some number of those 120 disks and line them up and paint a different color across all the disks on each of those 32 slices, you would be looking at 32 stripes and one stripe set. Which disks and how many disks per stripe set is something you have to determine for a particular disk farm, taking into account all the things that queue on a disk request, redundancy, the most efficient pairwise hardware mirroring of the drives if that is possible, etc. etc. etc.. So then if you look at the picture of the whole disk farm and you want to parcel out storage to different applications or databases it is child’s play, almost boring, to allocate a good solution that makes sense.In general though, when you add storage, the minimum increment tends to be a whole tray full of disks (because you want to clone your stripe sets definitions for ease of use, and if you just stick one drive in instead and add a little piece of it to each stripeset-based volume to grow the volume you will immediately produce a hot spot so intense that it has been known to destroy new drives well before their normal mean time between failure). SAME has a protocol for adding single drives, and ASM automates blending in additional storage over time.

It is entirely possible to arrange the Meta Devices to be stripes of a stripeset and then to allocate the Meta Devices from a given stripeset to only one database. This is part of the BORING protocol. You can implement it with disk groups in ASM. If isolation of disk i/o demand is what you want, that is as good a way to do it as any, either with ASM or by hand. For the disk farm interfaces I am aware of, you have to do the book keeping to keep track of which [meta devices, volumes, plexes, plex sets, make up your own new name] are which and which disks comprise them. Using consistent nomenclature can automatically create a virtual notebook, but you have to remember that the volume managers are not going to enforce your nomenclature against your typos.

Arranging things in this BORING way is also conducive to producing good thinking about adding faster media types to an existing disk farm. Oh, BORING is Balanced Organization of Resources in Natural Groups. So if you add some 15Krpm, 256M cache drives to a farm that is previously made of 7.2Krpm, 4M cache drives, don’t mix them up in existing stripe sets.  Likewise if you add some mirrored (aka duplexed) persistent ram disk devices. Make them be separate stripesets and separate disk groups if you’re using ASM.

So you still stripe and mirror everything. Just not all in one piece. And to the extent you are able to divide the use of channels, cache, and disk platters you will isolate (protect) the response time of one application from high use by other applications. Isolating cache usage runs from easy to impossible depending on what the disk array supports. Interestingly enough if you cannot partition cache and your I/O demand underflows what the cache is capable of, then after warmup any old SAME and a perfectly arranged BORING layout will perform the same. (You also won’t be asking the question below [ed: about response time] if your load demand underflows cache capability).

Now, lest someone think I am being unfair to SAME, remember that if you don’t actually have a disk performance problem, then some variety of SAME is probably the cheapest to configure and maintain.

Also, notice that in the timezone peak load case, with BORING you have less total disk throughput to serve each timezone while the disks for the other time zones sit nearly idle. Of course that might be a good time to run batch jobs and back up the other time zones, but SAME would have made all the horsepower available to each time zone.

BORING takes a bit more configuration or a lot more depending on the technology and tools you have. If you have no idea what the load demands of the different applications or databases will be, then you don’t really have a basis for configuring BORING for performance advantage immediately, but if you keep it really boring it will be easy to reallocate. There was a time when it seemed like the vendors assembled the disk farms in the worst possible way at the factory and then you had to pay extra for tool suites to rebuild them sensibly, but I must have been imagining that which I write for legal purposes. SAME and BORING each have strong points and best use cases. What you seem to indicate you have may be what I call “HAPHAZARD” for which I have no spelled out words. Autogenerated HAPHAZARD may be okay as long as you never have to look at it and understand it. And you might not have to look at it, except that you seem to think you are having I/O problems, so I guess you do have to look at it.Finally, if perchance you acquire a disk farm that is 50/50 divided for test and production so that your load simulations in test will be “just like the real thing” make very certain you understand which way it was cut in half before you let someone start a load test after production is in service. If they allocated half the disks and half the channels, etc. to each, you’ll be fine. If they cut each disk platter in half by definition of partitions…. you likely won’t be fine.

Regards,

mwf

The only thing I think I could add to that is a reference to James Morle’s article on Sane SAN.

3 Comments »

  1. Nice. I just forwarded this to an ex-colleague with a note that, even if you know all this stuff already, it expresses it well when you want to explain it to someone else. I’ll be keeping this link :-)

    Comment by Doug Burns — February 6, 2007 @ 3:09 am GMT Feb 6,2007 | Reply

  2. […] There is an excellent piece here about some of the choices to be made in configuring SANs for your application. The key point is […]

    Pingback by Magic SANs « Inside Documentum — February 7, 2007 @ 10:31 am GMT Feb 7,2007 | Reply

  3. Silly me. The BORING acronym I quoted in the note was the beta release. After a contest with a few of my friends in January 2002, “Butt-kicking Organization of Rust in Groups” was disqualified for not accounting for SSD and the winner was announced as “Best Optimization of Resources in Natual Groups.”

    Best trumped Balanced, for the same reasons in the note.

    Optimization trumped Organization, since you could certainly be poorly organized.

    Not that I care so much about words in the acronym as long as it brings forth the useful shorthand for the discussion. After all, People Can’t Memorize Computer Industry Acronyms (author unknown).

    Comment by Mark W. Farnham — March 3, 2007 @ 2:06 am GMT Mar 3,2007 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.