Oracle Scratchpad

October 17, 2014

sreadtim

Filed under: Oracle,Statistics,System Stats,Troubleshooting — Jonathan Lewis @ 1:22 pm GMT Oct 17,2014

Here’s a question that appeared in my email a few days ago:

Based on the formula: “sreadtim = ioseektim + db_block_size/iotrfrspeed”, sreadtim should always bigger than ioseektim.

But I just did a query on my system, find it otherwise, get confused:

SQL> SELECT * FROM SYS.AUX_STATS$;<

SNAME                          PNAME                               PVAL1 PVAL2
------------------------------ ------------------------------ ---------- --------------------
SYSSTATS_INFO                  STATUS                                    COMPLETED
SYSSTATS_INFO                  DSTART                                    10-08-2014 10:45
SYSSTATS_INFO                  DSTOP                                     10-10-2014 10:42
SYSSTATS_INFO                  FLAGS                                   1
SYSSTATS_MAIN                  CPUSPEEDNW                     680.062427
SYSSTATS_MAIN                  IOSEEKTIM                              10
SYSSTATS_MAIN                  IOTFRSPEED                           4096
SYSSTATS_MAIN                  SREADTIM                            4.716
SYSSTATS_MAIN                  MREADTIM                            2.055
SYSSTATS_MAIN                  CPUSPEED                             1077
SYSSTATS_MAIN                  MBRC                                    4
SYSSTATS_MAIN                  MAXTHR                          956634112
SYSSTATS_MAIN                  SLAVETHR                           252928

How do we explain this ?

This question highlights two points – one important, the other only slightly less so.

The really important point is one of interpretation.  Broadly speaking we could reasonably say that the (typical) time required to perform a single block read is made up of the (typical) seek time plus the transfer time which, using the names of the statistics above, would indeed give us the relationship: sreadtim = ioseektim + db_block_size/iotfrspeed; but we have to remember that we are thinking of a simplified model of the world. The values that we capture for sreadtim include the time it takes for a request to get from Oracle to the O/S, through the various network software and hardware layers and back again; the formula ignores those components completely and, moreover, doesn’t allow for the fact that some “reads” could actually come from one of several caches between Oracle and the disc without any physical disc access actually taking place. Similarly we should be aware that the time for an actual I/O seek would vary dramatically with the current position  of the read head, the radial position of the target block, the speed and current direction of movement of the read head, and the rotational distance to the target block. The formula is not attempting to express a physical law, it is simply expressing an approximation that we might use in a first line estimate of performance.

In fact we can see in the figures above that multi-block reads (typically of 4 blocks)  were faster than single block reads on this hardware for the duration of the sampling period – and that clearly doesn’t fit with the simple view embedded in our formula of how disc drives work.  (It’s a fairly typical effect of SANs, of course, that large read requests make the SAN software start doing predictive read-ahead, so that the next read request from Oracle may find that the SAN has already loaded the required data into its cache.)

There is, however, the second point that these figures highlight – but you have to be in the know to spot the detail: whatever the complexities introduced by SAN caching, we’re not comparing the right numbers. The ioseektim and iotfrspeed shown here are the default values used by Oracle. It looks as if the user has called dbms_stats.gather_system_stats() with a 48 hour workload (dstart = 8th Oct, dstop = 10th Oct) but hasn’t yet executed the procedure using the ‘noworkload’ option. Perhaps the ioseektim and iotfrspeed figures from a noworkload call would look a little more reasonable when compared with the 4.716 milliseconds of the gathered sreadtim.

There may still be a large gap between the model and the reality, but until the two sets of figures we’re using come from the same place we shouldn’t be comparing them.

 

13 Comments »

  1. I have an SR open for almost 2 weeks with exactly the same question.
    Recently they asked to do select kcfiosbr,kcfiosbt from x$kcfio and changed SR status back to Review Defect.
    I would expect sreadtim/mreadtim to be in line with ORION results, but it is way different from that.

    Comment by Ruslan — October 17, 2014 @ 5:14 pm GMT Oct 17,2014 | Reply

    • I use to set sreadtim/mreadtim manually, according to ORION/SLOB or other IO workload generators. Also setting MBRC to reasonable value is very important as well. I don’t like automatic setting of “db_file_multiblcok_read_count” (it still sometimes show strange values), manual setting to 1MB is always good value on enterprise SAN as well (and on Engineered system is no exception). then MBRC can be used to adjust slightly the preference of sequential/random reads. However, I suggest running also only no workload (or special EXADATA) statistics for most DBAs.

      Regards
      Pavol Babel

      Comment by Pavol Babel — October 17, 2014 @ 11:47 pm GMT Oct 17,2014 | Reply

    • Ruslan,

      the problem with mreadtim/sreadtim is that they reflect EVERY read request that the instance has done, and don’t allow for the fact that many of those reads may be local or SAN cache, whereas ORION is engineered to try to get more realistic times by doing things that aren’t likely to lead to accidental caching and pre-fetch effects. Would I be right in guessing that your ORION results were much slower than the sreadtim and mreadtim promised ?

      Comment by Jonathan Lewis — October 20, 2014 @ 9:48 am GMT Oct 20,2014 | Reply

      • Hi Jonathan,

        >> don’t allow for the fact that many of those reads may be local or SAN cache
        Both ORION and ASM get some of their results from SAN cache.
        ORION was designed to simulate ASM disk access patterns so I guess it should have very similar performance results.

        ORION indeed does take into effect SAN cache – by default it does two minutes of of random unmeasured reads
        to “warm” the cache:
        https://docs.oracle.com/cd/E18283_01/server.112/e16638/iodesign.htm#sthref576
        So it might have some effects.

        What do you mean by local cache? ASM works with disks directly so there are no other caches except SAN cache.
        There are some exotic options like ASM on NFS and it might have some local cache effects but we don’t use that :)

        There are some good news on SR 3-9701138511 – System Statistics: Sreadtim And Mreadtim Are Less Than Ioseektim
        Bug 19824941 – System Statistics: Sreadtim And Mreadtim Are Less Than Ioseektim
        patch 13455010 (not published)
        I heard it might be part of 12.1.0.2.4 PSU, not back-portable to 11g.

        Thank you!

        Comment by Dautkhanov — April 30, 2015 @ 6:12 pm GMT Apr 30,2015 | Reply

        • Dautkhanov,

          Orion certainly can be told to warm the cache – but it does it through “large random reads”, and I’m always pessimistic about “sensibly engineered” tests not being a very good indicator of insanely performance applications (please assume that goes with a small smile). I’ll agree with “some effects” – but I wouldn’t ever be surprised to see mreadtim values from a live system that were much faster than those produced after Orion’s warming efforts.

          Local cache: Orion can be run against files on a file system – hence comments about local cache. Your comment reads as if I’m supposed to know about your specific hardware – but I don’t think I do (and I don’t have access to any SRs other than my own, and those granted to me by client companies).

          Nice to know that there’s likely to be a patch for the bug. I don’t think that bug is really related to the original question, though; it’s more about comparing a statement which ought to be true in theory with a set of results that isn’t true in practive (with the added confusion that some of the actual results are just the (1970’s) defaults rather than any collected value).

          Comment by Jonathan Lewis — May 5, 2015 @ 5:27 pm GMT May 5,2015

  2. Hi Jonathan,

    >> In fact we can see in the figures above that multi-block reads (typically of 4 blocks) were faster than single block reads on this hardware for the duration of the sampling period – and that clearly doesn’t fit with the simple view embedded in our formula of how disc drives work.

    I think it is worth to mention, that Oracle introduced a sanity check for system statistics to cover these kind of issues. The cost based optimizer recalculates the values of sreadtim and mreadtim, if the value of mreadtim is not available or if it is less than sreadtim. So in his case the sreadtim (and mreadtim) is recalculated anyway regardless the workload statistics.

    sreadtim ≈ ioseektim + (db_block_size / iotfrspeed)
    mreadtim ≈ ioseektim + (mbrc * db_block_size / iotfrspeed)

    This sanity check is also explained in detail on page 204 in book “Troubleshooting Oracle Performance – 2nd Edition” by Christian Antognini and the following test case here: http://antognini.ch/downloads/top2/chapter07/system_stats_sanity_checks.sql

    Best Regards
    Stefan

    Comment by Stefan Koehler — October 18, 2014 @ 11:45 am GMT Oct 18,2014 | Reply

    • Stefan,

      Thanks for the comment. The behaviour varies with version of Oracle, of course, but Christian probably made that point too; after all his book is probably the best on the market for understanding and resolving performance problems on an Oracle database.

      Comment by Jonathan Lewis — October 20, 2014 @ 9:42 am GMT Oct 20,2014 | Reply

  3. Looks like a bug was opened for this: Bug 19824941 : LOW VALUES FOR SREADTIM AND MREADTIM

    Comment by jkstill — October 22, 2014 @ 1:09 pm GMT Oct 22,2014 | Reply

    • Looking at the numbers (9 micro seconds for a single block read / 696MB per second max throughput) I wonder if that’s a consequence of reads optimized through flash cache. I haven’t looked closely at how Oracle adjusts for the sreadtim and mreadtim when (lots of) reads come from flash.

      Comment by Jonathan Lewis — October 22, 2014 @ 2:01 pm GMT Oct 22,2014 | Reply

      • That’s the one. 9 microseconds isn’t achievable even with modern all-flash disks. And that database is using spinning disks. Yes, there is a storage cache but even with storage cache it could not return data in 9 microseconds.

        ORION shows a histogram of latencies, and there always two peaks – one when data was read from storage cache (that is around 500-1000 us, ~0.7ms). Another peak is when data is read physically from disks (averages around ~9ms).

        Latency: # of IOs (read) # of IOs (write)
        32 – 64 us: 0 0
        64 – 128 us: 0 0
        128 – 256 us: 8 0
        256 – 512 us: 739 0
        512 – 1024 us: 14680 0 — first peak – reads from storage cache
        1024 – 2048 us: 4562 0
        2048 – 4096 us: 12382 0
        4096 – 8192 us: 96822 0
        8192 – 16384 us: 103558 0 — second peak – physical reads
        16384 – 32768 us: 12558 0
        32768 – 65536 us: 2310 0
        65536 – 131072 us: 172 0
        131072 – 262144 us: 36 0
        262144 – 524288 us: 0 0
        524288 – 1048576 us: 0 0

        Not sure where Oracle system stats got 9 micro seconds. I support what Pavol wrote above – to set system stats according to orion results.

        Comment by dautkhanov — October 22, 2014 @ 5:16 pm GMT Oct 22,2014 | Reply

  4. […] a multiblock read will be quite small (eight blocks) and a lot slower than a single block read (26 ms vs. 12 ms). One simple change that many people might have made during or shortly after installation (though […]

    Pingback by Index Usage | Oracle Scratchpad — March 29, 2016 @ 10:53 am GMT Mar 29,2016 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.