Flash Cache

September 22, 2011

Flash Cache

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 5:00 pm BST Sep 22,2011

Have you ever heard the suggestion that if you see time lost on event write complete waits you need to get some faster discs.
So what’s the next move when you’ve got 96GB of flash cache plugged into your server (check the parameters below) and see time lost on event write complete waits: flash cache ?

db_flash_cache_file           /flash/oracle/flash.dat
db_flash_cache_size           96636764160

Here’s an extract from a standard 11.2.0.2 AWR report:

                           Total
Event                      Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
write complete waits: flas    32                     3.1              21.9  75.0

                           Waits
                           64ms
Event                      to 2s <32ms <64ms <1/8s <1/4s <1/2s   <1s   <2s  ?=2s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
write complete waits: flas    11   3.1   3.1   3.1   3.1   6.3   6.3  12.5  62.5

                           Waits
                            4s
Event                      to 2m   <2s   <4s   <8s  <16s  <32s  < 1m  < 2m  >=2m
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
write complete waits: flas    20  37.5  18.8   9.4  12.5   9.4   9.4   3.1

It’s interesting to see the figures for single and multiblock reads from flash cache. The hardware is pretty good at single block reads – but there’s a strange pattern to the multiblock read times. The first set of figures is from the Top N section of the AWR, the second set is from the event histogram sections (the 11.2 versions are more informative than the 11.1 and 10.2 – even though the arithemtic seems a little odd at the edges). Given the number of reads from flash cache in the hour the tiny number of write waits isn’t something I’m going to worry about just yet – my plan is to get rid of a couple of million flash reads first. (Most of the read by other session waits are waiting on the flash cache read as well – so I’ll be aiming at two birds with one stone.)

                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
db flash cache single block ph    3,675,650       5,398      1   35.2 User I/O
DB CPU                                            4,446          29.0
read by other session             1,092,573       1,407      1    9.2 User I/O
direct path read                      6,841       1,371    200    8.9 User I/O
db file sequential read             457,099       1,046      2    6.8 User I/O

                                                    % of Waits
                                 -----------------------------------------------
                           Total
Event                      Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
db flash cache multiblock  22.1K   7.6  12.3   7.7   6.0   8.8  24.7  32.8
db flash cache single bloc 3683K  66.6  22.6   3.6   2.4   3.9    .8    .0

                           Waits
                           64ms
Event                      to 2s <32ms <64ms <1/8s <1/4s <1/2s   <1s   <2s  >=2s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
db flash cache multiblock   7255  67.2  21.5  10.4    .9    .0
db flash cache single bloc  1587 100.0    .0    .0    .0    .0    .0

Answers to the following question on a postcard please: Why do we get a “double hump” on the distribution of multiblock reads ?

Comments (10)

10 Comments »

Was this using an ASM disk group for the flash cache and if so, was a block size specified for the group? Those multi-block read times are horrendous.

Comment by Stephen Butterworth — September 22, 2011 @ 5:31 pm BST Sep 22,2011 | Reply
2ms to erase, larger times to write?

Comment by jgarry — September 22, 2011 @ 8:06 pm BST Sep 22,2011 | Reply
Most SSDs (on which Flash Cache should be be based) suffer from what is termed “Write Amplification” which can, depending on the underlying SSD technology and algorithms, introduce spikes in write performance. More details in http://en.wikipedia.org/wiki/Write_amplification but in short, writes to SSD will require “garbage collection” to free up blocks in the SSD array to complete the writes, and this can introduce significant pauses depending on the amount of writes, and the rate at which they arrive into Flash Cache.

Comment by jkanagaraj — September 22, 2011 @ 9:36 pm BST Sep 22,2011 | Reply
Most SSDs (on which Flash Cache should be be based) suffer from what is termed “Write Amplification” which can, depending on the underlying SSD technology and algorithms, introduce spikes in write performance. More details in http://en.wikipedia.org/wiki/Write_amplification but from what I understand, the SSD does not overwrite existing data but writes into free blocks, and as a result has to perform periodic “garbage collection” to free up such space. This can result in random spikes depending on amount and rate of writes.

Comment by jkanagaraj — September 22, 2011 @ 10:27 pm BST Sep 22,2011 | Reply
Jonathan,

any updates to this issue? It is extremly interesting, at leat to me. I have several questions. How is the SSD drive attached to system? It is some kind of internal disk (pluged directly to server) or is it attached to disk array? Is it really an performance issue, or just some bug of Oracle Wait Interface?
Maybe you should find your old C proram for testing IO of that SSD disk :) I would like to note extra one point, I consider your single clock reads from SSD too slow, too!

I have run C program written by my colleague Vlado on our AIX P6 4200MHz on 64kB stripped FS with 5lunes attached (2Gb fibres) to IBM DS 8300. The program was reading only 100 8kB blocks (to take advance od STORAGE CACHE) in RANDOM ORDER:
```
./iotool RR /u101/oracle/CTPB/babopom 8 100 5 5 

#(RANDOM READS 8kB first 100 block of file for 5 seconds)


  IO [ms]           Count   Sum time [ms]   Avg time [ms]
--------- --------------- --------------- ---------------
<       1           18977         4940.47            0.26
<       2               8            9.43            1.18
<       4               0            0.00            0.00
<       8               0            0.00            0.00
<      16               0            0.00            0.00
<      32               0            0.00            0.00
<      64               0            0.00            0.00
<     128               0            0.00            0.00
<     256               0            0.00            0.00
<     512               0            0.00            0.00
<    1024               0            0.00            0.00
<    2048               0            0.00            0.00
<    4096               0            0.00            0.00
<    8192               0            0.00            0.00
<   16384               0            0.00            0.00
<   32768               0            0.00            0.00
```
Comment by Pavol Babel — October 2, 2011 @ 12:27 pm BST Oct 2,2011 | Reply
- Unfortunatelly, I forgott to format text. But I hope it is readable and we can clearly see 18985 random reads and only 8 of them (0,0004%) took over 1ms and the avarage is 0.2ms. It seems 0.2ms are mainly the latency of SAN infrastructure (SAN switch).
  I think SDD disk should be at least as fast as our test.
  I should also mention my collegues program is opening file in O_DSYNC with help of CIO (IBM concurrent I/O which completely bypass the OS filesystem cache).
  
  lsof:
```
COMMAND      PID   USER   FD   TYPE     FILE-FLAG DEVICE  SIZE/OFF NODE NAME
iotool  15343812 oracle    3r  VREG R,CIO,DSYN,LG   53,1 108765184   10 /u101/oracle/CTPB/babopom
```
  SO I wonder that 34% of flash single read block on your customer’s site took over 1ms. And of course, multiblok operations wreck. Thah’s why I recommend some simple tool to check the performance of SSD disk (and you have written one somewhere in ’97).
  
  hope you will enjoy OOW
  
  Regards
  Pavol
  
  Comment by Pavol Babel — October 2, 2011 @ 12:43 pm BST Oct 2,2011 | Reply
Regarding the double hump: It’s almost as if sometimes the multiblock reads are done either as a single large read or multiple reads concurrently and then other times the multiblock reads are done as 64 (or whatever the multiblock read count is) reads one after the other. Could this be something in the SSD controller or dependent on whether it is all on the same chip or something?

Comment by Ashley — October 4, 2011 @ 10:05 am BST Oct 4,2011 | Reply
- Ashley,
  
  I think that may be the sort of thing that’s going on. It would be nice to get a little time to experiment with the kit and see if it’s a built-in effect that can’t be bypassed, or whether there’s some software configuration that would make a difference.
  
  Comment by Jonathan Lewis — October 4, 2011 @ 3:29 pm BST Oct 4,2011 | Reply
Jonathan,

When you say “plugged into your server”, what does that mean? Is this a PCI flash card or an SSD downwind of a PCI disk controller?

P.S., Here is an idea: Never augment DRAM L1 cache with physical reads and writes to magnetic media! The whole database flash cache is a lark! Max out your DRAM first.

Comment by kevinclosson — October 21, 2011 @ 4:27 pm BST Oct 21,2011 | Reply
- Kevin,
  
  Sorry, I don’t remember. I think at the time the DBA probably said something that made me assume it was some type of flash card – but I could be wrong.
  
  Comment by Jonathan Lewis — November 13, 2011 @ 3:07 pm GMT Nov 13,2011 | Reply