Oracle Scratchpad

October 23, 2019

Clustering_Factor

Filed under: CBO,Indexing,Oracle — Jonathan Lewis @ 9:56 pm BST Oct 23,2019

A few days ago I published a little note of a script I wrote some time ago to estimate the clustering_factor of an index before it had been built. At the time I pointed out that one of its limitations was that it would not handle cases where you were planning to set the table_cached_blocks preference, but a couple of days later I decided that I’d write another version of the code that would cater for the new feature – and that’s how I made an embarrassing discovery.

Having reviewed a number of notes I’ve published about the table_cached_blocks preference and its impact on the clustering_factor I’ve realised the what I’ve written has always been open to two interpretations – the one that I had in mind as I was writing, and the correct one.  I made this discovery because I had written a simple SQL statement – using the match_recognize() mechanism – to do what I considered to  be the appropriate calculation. After testing the query with a few sets of sample data that produced the correct results I emailed Stew Ashton (my “go-to” person for match_recognize() questions) asking if he would do a sanity check on the code because it was rather slow and I wondered if there was a better way of writing it.

His reply was roughly:

“I’ve read the notes you and Richard Foote have written about the clustering_factor and table_cached_blocks, and this isn’t doing what your description says it should.”

Then he explained what he had inferred from what I had written … and it made more sense than what I had been thinking when I wrote it. He also supplied some code to implement his interpretation – so I designed a couple of data models that would produce the wrong prediction for whichever piece of code implemented the wrong interpretation. His code gave the right answers, mine didn’t.

So here’s the difference in interpretation – the wrong one first – using 16 as a discussion value for the table_cached_blocks:

  • WRONG interpretation:  As you walk through index entries in order remember the last 16 rowids (that’s rowid for the rows in the table that the index is pointing to) you’ve seen. If the current rowid has a block id component that doesn’t match the block id from one of the remembered 16 rowids then increment the counter for the clustering_factor.
    • The simplicity of this algorithm means you can fix a “circular” array of 16 entries and keep walking around the circle overwriting the oldest entry each time you read a new one. It’s a pity that it’s the wrong idea because there’s a simple (though massively CPU -intensive match_recognize() strategy for implementing it – and if you were using an internal library mechanism during a proper gather_index_stats() it could be incredibly efficient.
  • RIGHT interpretation: set up an array for 16 block ids, each with an associated “row-number”. Walk through the index in order – giving each entry a row-number as you go. Extract the block id from the current entry and search through the array for a matching block id.  If you find a match then update its entry with the current row-number (so you can remembr how recently you saw the block id); if you don’t find a match then replace the entry that has the smallest (i.e. greatest distance into the past) row-number with the current block id and row-number and increment the counter for the clustering_factor.

The first piece of code that Stew Ashton sent me was an anonymous PL/SQL block that included some hard-coded fragments and embedded SQL to use a test table and index that I had defined, but he then sent a second piece of code that creates a generic function that uses dynamic SQL to construct a query against a table and an index definition that you want to test. The latter is the code I’ve published (with permission) below:


create or replace function predict_clustering_factor(
/*
Function to predict the clustering factor of an index,
taking into account the intended value of
the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS.

Input is the table name, the list of column names
and the intended value of TABLE_CACHED_BLOCKS.

The function collects the last N block ids (not the last N entries).
When there is no more room, it increments the clustering factor
and replaces the least recently used block id with the current one.

Note: here a "block id" is a rowid with the row_number portion set to 0.
It is effectively a "truncated" rowid.
*/
  p_table_name in varchar2,
  p_column_list in varchar2,
  p_table_cached_blocks in number
) return number authid current_user is

  rc sys_refcursor;
  type tt_rids is table of rowid;
  lt_rids tt_rids;
  
  type t_block_list is record(
    rid rowid,
    last_hit number
  );

  type tt_block_list is table of t_block_list;
  lt_block_list tt_block_list := new tt_block_list();

  l_rid rowid;
  l_clustering_factor number := 0;
  b_block_found boolean;
  l_rn number := 0;
  l_oldest_hit number;
  i_oldest_hit binary_integer := 0;
  
  function truncated_rid(p_rid in rowid) return rowid is
    rowid_type number;
    object_number NUMBER;
    relative_fno NUMBER;
    block_number NUMBER;
    row_number NUMBER;
    rid rowid;

  begin

    DBMS_ROWID.ROWID_INFO (
      p_rid,
      rowid_type,
      object_number,
      relative_fno,
      block_number,
      row_number
    );

    rid := DBMS_ROWID.ROWID_CREATE (
      rowid_type,
      object_number,
      relative_fno,
      block_number,
      0
    );

    return rid;

  end truncated_rid;
  
begin
  if p_table_cached_blocks != trunc(p_table_cached_blocks)
  or p_table_cached_blocks not between 1 and 255 then
    raise_application_error(
      -20001, 
      'input parameter p_table_cached_blocks must be an integer between 1 and 255'
    );
  end if;

  open rc for 'select rowid from '||p_table_name||' order by '||p_column_list||', rowid';
  loop
    fetch rc bulk collect into lt_rids limit 1000;

    for irid in 1..lt_rids.count loop
      l_rn := l_rn + 1;
      l_rid := truncated_rid(lt_rids(irid));
      b_block_found := false;
      l_oldest_hit := l_rn;

      if l_rn = 1 then
        l_clustering_factor := l_clustering_factor + 1;
        lt_block_list.extend;
        lt_block_list(1).rid := l_rid;
        lt_block_list(1).last_hit := l_rn;

      else

        for i in 1..lt_block_list.count loop
          if l_oldest_hit > lt_block_list(i).last_hit then
            l_oldest_hit := lt_block_list(i).last_hit;
            i_oldest_hit := i;
          end if;
          if lt_block_list(i).rid = l_rid then
            b_block_found := true;
            lt_block_list(i).last_hit := l_rn;
            exit;
          end if;
        end loop;

        if not b_block_found then
          l_clustering_factor := l_clustering_factor + 1;
          if lt_block_list.count < p_table_cached_blocks then
            lt_block_list.extend;
            lt_block_list(lt_block_list.count).rid := l_rid;
            lt_block_list(lt_block_list.count).last_hit := l_rn; 
          else         
            lt_block_list(i_oldest_hit).rid := l_rid;
            lt_block_list(i_oldest_hit).last_hit := l_rn;
          end if;
        end if;

      end if;

    end loop;
    exit when rc%notfound;
  end loop;

  close rc;
  return l_clustering_factor;

exception when others then
  if rc%isopen then
    close rc;
  end if;
  raise;

end predict_clustering_factor;
/

After executing the above to create the function, here’s an example of usage:

rem
rem     Script:         clustering_factor_est_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2019
rem
rem     Last tested
rem             19.3.0.0
rem             12.2.0.1
rem

create table t1
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        cast(rownum as varchar2(10))            v1,
        trunc(dbms_random.value(0,10000))       rand,
        rpad('x',100,'x')                       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
/

-- -------------------------------------------------------------------

SQL> execute dbms_output.put_line('Predicted cf for t1(rand, id): ' || predict_clustering_factor('t1','rand, id',16))
Predicted cf for t1(rand, id): 997218
Elapsed: 00:00:07.54

SQL> execute dbms_output.put_line('Predicted cf for t1(rand, id): ' || predict_clustering_factor('t1','rand, id',255))
Predicted cf for t1(rand, id): 985607
Elapsed: 00:00:50.61

You’ll notice that the larger the setting for the “table_cached_blocks” parameter the more time it takes to predict the clustering_factor – and it was all CPU time in my example. This isn;t surprising given the need to search through an array holding the previous history. In this example the table t1 holds 1,000,000 rows, and the number and scatter of distinct values is so arranged that the code will hardly ever find a cached block id – essentially it’s the sort of index that isn’t going to cause much of confusion to the optimizer and isn’t likely to need special attention to make the optimizer use it when it should and ignore it when it’s inappropriate.

Finally a cut-n-paste to show the accuracy of the two predictions:

SQL> create index t1_i on t1(rand, id);
Elapsed: 00:00:02.96

SQL> execute dbms_stats.set_table_prefs(null,'t1','table_cached_blocks',16)
Elapsed: 00:00:00.01

SQL> execute dbms_stats.gather_index_stats(null,'t1_i')
Elapsed: 00:00:09.55

SQL> select clustering_factor from user_indexes where index_name = 'T1_I';

CLUSTERING_FACTOR
-----------------
           997218

Elapsed: 00:00:00.11

SQL> execute dbms_stats.set_table_prefs(null,'t1','table_cached_blocks',255)
Elapsed: 00:00:00.01

SQL> execute dbms_stats.gather_index_stats(null,'t1_i')
Elapsed: 00:00:07.80

SQL> select clustering_factor from user_indexes where index_name = 'T1_I';

CLUSTERING_FACTOR
-----------------
           985607

Elapsed: 00:00:00.00

Both match perfectly – but you might notice that creating the index and gathering the stats was much faster than predicting the clustering factor for the case where we set table_cached_blocks = 255.

(If you’re wondering, my “simple but irrelevant” match_recognize() query took 370 CPU second to complete for table_cached_blocks = 200 – and a limit on march_recognize() meant that 200 was the maximum value I was allowed to use – so now you know why I emailed Stew Ashton (and just for lagniappe. he also told me about a simple workaround for the 200 limit)).

 

 

13 Comments »

  1. Hello Jonathan,

    Thanks a lot for this excellent post :)

    If you don’t mind, I think that it could very instructive for us to also see the “irrelevant” (as you call it) match_recognize() query,
    just as a usage sample of this elegant technique.
    Good lessons can very often be learned from such cases, even if not exactly solving the initial problem,
    so it would be great if you agreed to publish it as well.

    Many thanks for all your invaluable lessons & Best Regards,
    Iudith Mentzel

    Comment by Iudith Mentzel — October 24, 2019 @ 1:11 am BST Oct 24,2019 | Reply

  2. Jonathan,

    I’m wondering a bit why these scripts attempt to (re-)build/reimplement the functionality of remembering the history of blocks visited… You yourself pointed out already in Cost Based Fundamentals that Oracle uses internally the function SYS_OP_COUNTCHG to calculate the Clustering Factor when gathering index statistics using DBMS_STATS that offers the “history” out of the box using a corresponding parameter. Is there a specific reason why you didn’t want to make use of that built-in function?

    A long time ago I wrote a post describing a rather similar intent – knowing the Clustering Factor before actually building the corresponding index – based on that function, and also offering the option to specify a value for the history to remember:

    https://oracle-randolf.blogspot.com/2010/01/clusteringfactor-what-if-analysis.html

    Kind regards,
    Randolf

    Comment by Randolf Geist — October 24, 2019 @ 1:04 pm BST Oct 24,2019 | Reply

    • Randolf

      The point was to get an idea of the clustering_factor without building the index – and sys_op_count_chg() is relevant only for existing indexes.

      The (previous) post was also based on a draft I started a few years ago, and I did point out that nowadays it might be more efficient simply to create the index as an invisible one and see what actual numbers would appear.

      Regards
      Jonathan Lewis

      Comment by Jonathan Lewis — October 24, 2019 @ 7:46 pm BST Oct 24,2019 | Reply

      • Jonathan,

        I’ve just did a quick test run on 19.3 using my very old script and test case described in the link above – and like I remembered, the “sys_op_count_chg()” function works also without any existing/matching/suitable index and gives exactly the prediction of the clustering factor you get then when creating the index / executing DBMS_STATS after setting “table_cached_blocks” etc.

        So I don’t think that it is only relevant for existing indexes.

        Kind regards,
        Randolf

        Comment by Randolf Geist — November 5, 2019 @ 2:58 pm GMT Nov 5,2019 | Reply

        • Randolf,

          Apologies, I only took a quick glance at your post before answering your question, and I think I only got as far as point 1 on your list of reasons for using sys_op_countchg().

          To answer your original queston properly: I wasn’t aware that you could use sys_op_count_chg() without forcing Oracle to physically use an index; in fact I think I tried something similar while writing CBO-F and failed to get it working and never thought to check if things had changed (which is always a cardinal error with Oracle).

          Thanks for coming back to correct me.

          Regards
          Jonathan Lewis

          Comment by Jonathan Lewis — November 7, 2019 @ 9:53 am GMT Nov 7,2019

  3. […] couple of days ago I posted a note with some code (from Stew Ashton) that derived the clustering_factor you would get for an index when you had set a value for the table_cached_blocks preference but had […]

    Pingback by match_recognize() | Oracle Scratchpad — October 25, 2019 @ 6:47 pm BST Oct 25,2019 | Reply

  4. Hi Jonathan,
    That’s a very fascinating post, thank you, and thanks Stew for working on this. I think, it’s has an interesting approach, and it’s somehow satisfying to see custom PL/SQL code shows exactly the same results as an Oracle internal mechanism.

    Recently I came across a somewhat similar issue, where I used a pl/sql collections as a cache for storing results returned by web service, thus avoiding repetitive web calls. What crossed my mind when I was reading the code of predict_clustering_factor, is that we probably loose some time on looping through collections, and it might be nice if instead of scanning through all the values on each step, we could perform search in one operation, like accessing a hash table.

    I puzzled over a bit, and here’s what I came with (attaching link to livesql to save space): https://livesql.oracle.com/apex/livesql/s/i5mddh9xvwlydnopgjjss2bul

    This version works a bit slower on lower table_cached_blocks values, but shows constant time on all of the range up to 255, and some times works a bit faster with higher values. The results totally match the original Stew’s version, so I hope, I didn’t screw something up :)
    I’ve run the same test, Jonathan, you did in the post, and here is a short version of its results from my VM:

    predict_clustering_factor(‘t1′,’rand, id’,16): 00:00:03.682
    predict_clustering_factor(‘t1′,’rand, id’,255): 00:00:23.379

    predict_clustering_factor2(‘t1′,’rand, id’,16): 00:00:06.666
    predict_clustering_factor2(‘t1′,’rand, id’,255): 00:00:05.612

    Also, native compilation might give a bit of boost:

    ALTER SESSION SET PLSQL_CODE_TYPE=’NATIVE’;
    ALTER FUNCTION predict_clustering_factor COMPILE;
    ALTER FUNCTION predict_clustering_factor2 COMPILE;

    It most significantly affects Stew’s version with 255 cached blocks, trimming run time down to about 16 seconds on my machine. Gains for other cases are noticeable, but much less significant.
    We still can’t match the speed of the internal mechanism, so this whole thing seems more like a fun exercise, rather than a practical solution :)

    Though, I’d like to explain how I was approaching the problem, and you could safely hide this part under the cut or remove it altogether, it might get boring :)

    Since, there is no like ‘true’ hash tables in pl/sql, still associative arrays allow us to access their values by index, and add values with arbitrary indexes. We need to store and access a hash value that uniquely identifies file and block, and a “row number” to get an idea of how long ago particular block was accessed.
    With this in mind, we could implement cache using two associative arrays: one stores row_numbers and is indexed by “block hash value”, the other one vice versa stores block hashes and is indexed by row_number. This way we can quickly find a block either by its hash, or by row_number. After that we define three procedures that maintain our cache:
    – add_to_cache(block_hash, rownum) – adds a value to both arrays;
    – delete_oldest() – finds a block hash value with lowest index in the array indexed by rownum and deletes it from array indexed by block hash, then deletes it from the first array as well;
    – update_touch(block_hash, rownum) – if a value was found in cache, this procedure finds old row_num by block_hash in the first array, deletes it from the second one, reinserts it with new row_num, then updates row_num if the first array. (I’m not sure if update_touch is a good name here, but it somehow seems explanatory :))
    All that’s left is to compute a block hash, which may be done by getting relative file number and block number from rowid, and combining them using Cantor pairing, since both are positive integers.
    I’m walking on a thin ice, using pls_integers everywhere, cause they might overflow, and a fetch limit is rather high, but this shouldn’t be a big problem for modern pc, I hope )
    Thank you for reading the whole thing :)

    Comment by Viacheslav Andzhich — October 26, 2019 @ 8:39 pm BST Oct 26,2019 | Reply

    • Viacheslav,

      Thanks for this. My “live” code is now here: https://stewashton.wordpress.com/2019/10/26/predict_clustering_factor/

      I thought of associative arrays too, but I don’t see how they fit with my new objective, which is to find at the same time clustering factors for all possible TABLE_CACHED_BLOCKS settings. Using PLSQL_CODE_TYPE=’NATIVE’ thanks to you, I now get all the clustering factors in about 18 seconds on my laptop. That is faster than gathering statistics on the index 255 times :-)

      Best regards,
      Stew

      Comment by stewashton — October 28, 2019 @ 11:19 am GMT Oct 28,2019 | Reply

    • Viacheslav,

      Thank you for writing this, posting the code to livesql and then explaining the logic. It’s a lovely example of the things you can do with pl/sql and associative arrays – the symmetrical pair or arrays is particularly elegant. I have to admit I had not come across the Cantor pairing concept before, so that was an extra little treat.

      Comment by Jonathan Lewis — October 31, 2019 @ 12:37 pm GMT Oct 31,2019 | Reply

  5. […] Lewis recently wrote about estimating the clustering factor of an index, taking into account the intended value of the TABLE_CACHED_BLOCKS parameter of […]

    Pingback by Predict_Clustering_Factor | An Oracle Programmer — October 27, 2019 @ 6:22 am GMT Oct 27,2019 | Reply

  6. Hello Jonathan,

    Just for fun, I tried to create a SELECT for the “wrong interpretation” of the CF calculation,
    based on analytics.

    The following one could have been the “more elegant” one, but, unfortunately, the COLLECT function
    does not have an analytic form …

    CREATE OR REPLACE TYPE file_block_nt AS TABLE OF VARCHAR2(9)
    /
    
    select rand,
           rid,
           file_block_id
    from (
      select rand, 
             rid,
             file_block_id,
             CAST( COLLECT ( file_block_id )  OVER( ORDER BY rand, rid 
                                                    ROWS BETWEEN 32 PRECEDING AND 1 PRECEDING )  
                   AS file_block_nt )  prev_file_block
      FROM (
                    select
                            rand, 
                            rowid  rid,
                            cast(substr(rowid,7,9) as varchar2(9))  file_block_id
                    from
                            t1
                    where
                            rand is not null
           )
    )
    where file_block_id MEMBER OF prev_file_block
    /
    
    

    Interestingly enough, trying to run this in LiveSQL produces an internal error, instead of the (expected) syntax error.

    The other variant is a work-around to the same idea:

    
    WITH fb AS (
         select
                rand, 
                rowid  rid,
                substr(rowid,7,9)  file_block_id
         from
                t1
         where
                rand is not null
    )
    select rand,
           rid,
           file_block_id
    from (
      select rand, 
             rid,
             file_block_id,
             CAST( MULTISET (
                        SELECT fb2.file_block_id 
                        FROM fb  fb2
                        WHERE ( fb2.rand < fb.rand  OR                    
                               fb2.rand = fb.rand AND fb2.rid < fb.rid )  
                        ORDER BY fb2.rand DESC, fb2.rid DESC
                        FETCH FIRST 32 ROWS ONLY
                   )  AS file_block_nt )  prev_file_block
      from fb  
      )
    where file_block_id MEMBER OF prev_file_block
    /
    
    

    This is less elegant than the above one, because it needs to set the “starting” point of the MULTISET query,
    for “looking backward” from the current row only ( in addition to the descending order )

    But it works, though probably not the most efficient thing to do.

    As an exercise, it was however useful.

    Maybe supporting the COLLECT function also as an analytic aggregate function could be a useful enhancement request
    to Oracle.

    Thanks a lot & Best Regards,
    Iudith Mentzel

    Comment by Iudith Mentzel — October 28, 2019 @ 11:30 am GMT Oct 28,2019 | Reply

    • Iudith,

      Thanks for the examples – the COLLECT() option looks really elegant, so it’s a great shame it’s not supported (yet – as you indicate in the comment on match_recognize()).

      I think I’ve edited the corrected version of the code back into this one properly – but I inserted the “sourcecode” tags (in square brackets) very shortly after you wrote the comment without actually checking what was there and whether I had destroyed anything by accident. Probably best to send me an email if it needs further correction.

      Regards
      Jonathan Lewis

      Comment by Jonathan Lewis — October 31, 2019 @ 1:49 pm GMT Oct 31,2019 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.