Oracle Scratchpad

October 25, 2019

match_recognize()

Filed under: Match_recognize,Oracle — Jonathan Lewis @ 6:47 pm BST Oct 25,2019

A couple of days ago I posted a note with some code (from Stew Ashton) that derived the clustering_factor you would get for an index when you had set a value for the table_cached_blocks preference but had not yet created the index. In the note I said that I had produced a simple and elegant (though massively CPU-intensive) query using match_recognize() that seemed to produce the right answer, only to be informed by Stew that my interpretation of how Oracle calculated the clustering_factor was wrong and that the query was irrelevant.  (The fact that it happened to produce the right answer in all my tests was an accidental side effect of the way I had been generating test data. With Stew’s explanation of what Oracle was doing it was easy to construct a data set that proved my query was doing the wrong thing.)

The first comment I got on the posting was a request to publish my match_recognize() query – even though it was irrelevant to the problem in hand – simply because it might be a useful lesson in what could be done with the feature; so here’s some code that doesn’t do anything useful but does demonstrate a couple of points about match_recognize(). I have to stress, though, that I’m not an expert on match_recognize() so there may be a more efficient way of using it to acheieve the same result. There is certainly a simple and more efficient way to get the same result using some fairly straightforward PL/SQL code.

Requirement

I had assumed that if you set the table_cached_blocks preference to N then Oracle would keep track of the previous N rowids as it walked an index to gather stats and only increment the “clustering factor counter” if it failed to find a match for the block address of the current rowid in the block addresses extracted from the previous N rowids. My target was to emulate a way of doing this counting.

Strategy

Rather than writing code that “looked backwards” as it walked the index, I decided to take advantage of the symmetry of the situation and write code that looked forwards (you could think of this as viewing the index in descending order and looking backwards along the descending index). I also decided that I could count the number of times I did find a match in the trail of rowids, and subtract that from the total number of index entries.

So my target was to look for patterns where I start with the block address from the current rowid, then find the same block address after zero to N-1 failures to find the block address. Since the index doesn’t exist I’ll need to emulate its existence by selecting the columns that I want in the index along with the rowid, ordering the data in index order. Then I can walk this “in memory” index looking for the desired pattern.

Here’s some code to create a table to test against:


rem
rem     Script:         clustering_factor_est_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2019
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem 

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                                  id,
        cast(rownum as varchar2(10))            v1,
        trunc(dbms_random.value(0,10000))       rand,
        rpad('x',100,'x')                       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
/

My table has 1M rows, and there’s a column called rand which has 10,000 distinct values. This is generated through Oracle’s dbms_random package and the procedure I’ve used will give me roughly 100 occurrences for each value scattered uniformly across the table. An index on this column might be quite useful but it’s probably going to have a very high clustering_factor because, on average, any two rows of a particular value are likely to be separated by 10,000 rows, and even if I look for rows with a pair of consecutive values any two rows are likely to be separated by a few hundred other rows.

Here’s the code that I was using to get an estimate of (my erroneous concept of) the clustering_factor with table_cached_blocks set to 32. For debugging purposes it reports the first row of the pattern for each time my pattern is matched, so in this version of the code I’d  have to check the number of rows returned and subtract that from the number of rows in the table (where rand is not null).


select
        first_rand, first_file_id, first_block_id, first_row_id, step_size
from    (
                select
                        rand, 
                        dbms_rowid.rowid_relative_fno(rowid) file_id,
                        dbms_rowid.rowid_block_number(rowid) block_id,
                        dbms_rowid.rowid_row_number(rowid)   row_id
                from
                        t1
                where
                        rand is not null
        )
match_recognize(
        order by rand, file_id, block_id, row_id
        measures
                count(*) as step_size,
                first(strt.rand)        as first_rand,
                first(strt.file_id)     as first_file_id,
                first(strt.block_id)    as first_block_id,
                first(strt.row_id)      as first_row_id
        one row per match
        after match skip to next row
        pattern (strt miss{0, 31} hit)
        define 
                miss as (block_id != strt.block_id or  file_id != strt.file_id),
                hit  as (block_id  = strt.block_id and file_id  = strt.file_id)
)
order by 
        first_rand, first_file_id, first_block_id, first_row_id
;

The first point to note is that the inline view is the thing I use to model an index on column rand. It simply selects the value and rowid from each row. However I’ve used the dbms_rowid package to break the rowid down into the file_id, block_id and row_id within block. Technically (to deal with partitioned tables and indexes) I also ought to be thinking about the object_id which is the first component of the full rowid. I haven’t actually ordered the rows in the in-line view as I’m going to let the match_recognize() operation handle that.

A warning as we start looking at the match_recognize() section of the query: I’ll be jumping around the clause a little bit, rather than working through it in order from start to finish.

First I need the raw data sorted in order so that any results I get will be deterministic. I have an order by clause that sorts my data by rand and the three components I’ve extracted from the rowid. (It is also possible to have a partition clause – similar to the partition clause of an analytic function – but I don’t need one for this model.)

Then I need to define the “columns” that I’m planning to output in the final query. This is the set of measures and in the list of measures you can see a count(*) (which doesn’t quite mean what it usually does) and a number of calls to a function first() which takes an aliased column name as it’s input and, although the column names look familiar, the alias (strt) seems to have sprung from nowhere.

At this point I want to jump down to the define clause because that’s where the meaning of strt could have been defined.  The define clause is a list of “rules”, which are also treated as “variables”, that tell us how to classify a row. We are looking for patterns, and pattern is a set of rows that follows a set of rules in a particular order, so one of the things I need to do is create a rule that tells Oracle how to identify a row that qualifies as the starting point for a pattern – so I’ve defined a rule called strt to do this – except I haven’t defined it explicitly, it’s not visible in the define list, so Oracle has assumed that the rule is “1 = 1”, in other words every row I’m going to look at could be classified as a strt row.

Now that I have a definition for what strt means I could go back to the measures – but I’ll postpone doing that for a moment and look at the other rules in my define list. I have a rule called miss which says “if either of these comparisons evaluates to true” then the row is a miss;  but the predicates includes a reference to strt which means we are doing comparisons with the most recent row that was classified as a strt row. So a miss means we’ve found a starting row and we’re now looking at other rows comparing the block_id and file_id for each row to check that they don’t match the block_id and file_id of the starting row.

Similarly we have a hit rule which says a hit means we’ve previously found a starting row and we’re now looking at other rows checking for rows where the current block_id and file_id match the starting block_id and file_id.

Once we’ve got a set of rules explaining how to classify rows we can specify a pattern (which means going back up the match_recognize() clause one section). Our pattern reads:  “find a strt row, followed by zero and 31 miss rows, followed by a hit row”. And that’s just a description of my back-to-front way of saying “remember the last 32  rowids and check if the current block address matches the block address in one of those rowids”.

The last two clauses that need to be explained before I revisit the measures clause are the “one row per match” and “after match skip to next row”.

If I find a sequence of rows that matches my pattern there are several things I could do with that set of rows – for example I could report every row in that pattern along with the classification of strt/miss/hit (which would be useful if I’m looking for a few matches to a small pattern in a large data set), or (as I’ve done here) I could report just one row from the pattern and give Oracle some instruction about which one row I want reported.

Then, after I’ve found (and possibly reported) a pattern, what should I do next. Again there are several possibilities – the two most obvious ones, perhaps are: “just keep going” i.e. look at the row after the end of the pattern to see if it’s another strt row, and “go to the row after the start of the pattern you’ve just reported”. These translate into: “after match skip past last row” (which is the default if you don’t specify an “after match” clause) and “after match skip to next row” (which is what I’ve specified).

Finally we get back to the measures clause – I’ve defined four “columns” with names like ‘first_xxx’ and a step_size. The step_size is defined as count(*) which – in this context – means “count the number of rows in the current matched pattern”. The other measures are defined using the first() function, referencing strt alias which tells Oracle I want to retain the value from the first row that met the strt rule in the current matched pattern.

In summary, then my match_recognize() clause tells Oracle to

  • Sort the data by rand, file_id, block_id, row_id
  • For each row in turn
    • extract the file_id and block_id
    • take up to 32 steps down the list looking for a matching file_id and block_id
    • If you find a match pass a row to the parent operation that consists of: the number of rows between strt to hit inclusive, and the values of rand, file_id, block_id, and row_id of the strt row.

Before you try testing the code, you might like to know about some results.

As it stands my laptop with a virtual machine running 12.2.0.1 took 1 minute and 5 seconds to complete with “miss{0, 31}” in the pattern. In fact the first version of the code had the file_id and block_id tests in the define clause in the opposite order viz:

        define 
                miss as (file_id != strt.file_id or  block_id != strt.block_id),
                hit  as (file_id  = strt.file_id and block_id  = strt.block_id)

Since the file_id for my test is the same for every row in my table (it’s a single file tablespace), this was wasting a surprising amount of CPU, leading to a run time of 1 minute 17 seconds! Neither time looks particularly good when compared to the time required to create the index, set the table_cached_blocks to 32, and gather stats on the index – in a total time of less than 5 seconds. The larger the value I chose for the pattern match, the worse the workload became, until at “miss{0, 199}” – emulating a table_cached_blocks of 200 – the time to run was about 434 seconds of CPU!

A major portion of the problem is the way that Oracle is over-optimistic (or “greedy”, to use the technical term) with its pattern matching, combined with the nature of the data which (in this example) isn’t going to offer much opportunity for matching, combined with the fact that Oracle cannot determine that a row that is not a “miss” has to be a “hit”.  In this context “greedy” means Oracle will try to find as many consecutive occurrences of the first rule in the pattern before it tries to find and occurrence of the second rule – and when it fails to match a pattern it will “backtrack” one step and have another go, being slightly less greedy. So, for our example, the greedy algorithm will operate as follows:

  • find 31 rows that match miss, then discover the 32nd row does not match hit
  • go back to the strt and find 30 rows that match miss, then discover the 31st row does not match hit
  • go back to the strt and find 29 rows that match miss, then discover the 30th row does not match hit
  • … repeat until
  • go back to the strt and find 0 rows that match miss, then discover the 1st does not match hit
  • go to the next row, call it strt, and repeat the above … 1M times.

From a human perspective this is a pretty stupid strategy for this specific problem – but that’s because we happen to know that “hit” = “not(miss)” (ignoring nulls, of which there are none) while Oracle has to assume that there is no relationship between “hit” and “miss”.

There is hope, though, because you can tell Oracle that it should be “reluctant” rather than “greedy” which means Oracle will consume the smallest possible number of occurrences of the first rule before testing the second rule, and so on. All you have to do is append a “?” to the count qualifier:

        pattern (strt miss{0, 31 }? hit)

Unfortunately this seemed to have very little effect on execution time (and CPU usage) in our case. Again this may be due to the nature of the data etc., but it may also be a limitation in the way that the back tracking works. I had expected a response time that would more closely mirror the human expectation, but a few timed tests suggest the code uses exactly the same type of strategy for the reluctant strategy as it does for the greedy one, viz:

  • find 0 rows that match miss, then discover the next row does not match hit
  • go back to the strt and find 1 row that matches miss, then discover the 2nd row does not match hit
  • go back to the strt and find 2 rows that match miss, then discover the 3rd row does not match hit
  • … repeat until
  • go back to the strt and find 31 rows that match miss, then discover the 32nd does not match hit
  • go to the next row, call it strt, and repeat the above … 1M times.

Since there are relatively few cases in our data where a pattern match will occur both the reluctant and the greedy strategies will usually end up doing all 32 steps. I was hoping for a more human-like algorithm that would recognise that Oracle would recognise that if it’s just missed on the first X rows then it need only check the X+1th and not go back to the beginning (strt) – but my requirement makes it easy to see (from a human perspective) that that makes sense; in a generic case (with more complex patterns and without the benefit of having two mutially exclusive rules) the strategy of “start all over again” is probably a much safer option to code.

Plan B

Plan B was to send the code to Stew Ashton and ask if he had any thoughts on making it more efficient – I wasn’t expecting him to send a PL/SQL solution my return of post, but that’s what I got and published in the previous post.

Plan C

It occurred to me that I don’t really mind if the predicted clustering_factor is a little inaccurate, and that the “backtracking” introduced by the variable qualifer {0,31} was the source of a huge amount of the work done, so I took a different approach which simply said: “check that the next 32 (or preferred value of N) rows are all misses”.  This required two changes – eliminate one of the defines (the “hit”) and modify the pattern definition as follows:

         pattern (strt  miss{32} )
         define
                 miss as (block_id != strt.block_id or  file_id != strt.file_id)

The change in strategy means that the result is going to be (my version of) the clustering_factor rather than the number to subtract from num_rows to derive the clustering_factor. And I’ve introduced a small error which shows up towards the end of the data set – I’ve demanded that a pattern should include exactly 32 misses; but when you’re 32 rows from the end of the data set there aren’t enough rows left to match the pattern. So the result produced by the modified code could be as much as 32 short of the expected result.  However, when I showed the code to Stew Ashton he pointed out that I could include “alternatives” in the pattern, so all I had to do was add in to the pattern something which said “and if there aren’t 32 rows left, getting to the end of the data set is good enough” (technically that should be end of the current partition, but we have only one partition).

         pattern (strt  ( miss{32} | ( miss* $) ) )

The “miss” part of the pattern now reads:  “32 misses” or “zero or more misses and then the end of file/partition/dataset”.

It’s still not great – but the time to process the 1M rows with a table_cached_blocks of 32 came down to 31 seconds

Finally

I’ll close with one important thought. There’s a significant difference in the execution plans for the two strategies – which I’m showing as outputs from the SQL Monitor report using a version of the code that does a simple count(*) rather than listing any rows:


pattern (strt miss{0, 31} hit)
==========================================================================================================================================
| Id |         Operation          | Name  |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   |  Mem  | Activity | Activity Detail |
|    |                            |       | (Estim) |       | Active(s) | Active |       | (Actual) | (Max) |   (%)    |   (# samples)   |
==========================================================================================================================================
|  0 | SELECT STATEMENT           |       |         |       |        54 |    +14 |     1 |        1 |     . |          |                 |
|  1 |   SORT AGGREGATE           |       |       1 |       |        54 |    +14 |     1 |        1 |     . |          |                 |
|  2 |    VIEW                    |       |      1M | 12147 |        54 |    +14 |     1 |     2782 |     . |          |                 |
|  3 |     MATCH RECOGNIZE SORT   |       |      1M | 12147 |        60 |     +8 |     1 |     2782 |  34MB |          |                 |
|  4 |      VIEW                  |       |      1M |   325 |        14 |     +1 |     1 |       1M |     . |          |                 |
|  5 |       INDEX FAST FULL SCAN | T1_I1 |      1M |   325 |         7 |     +8 |     1 |       1M |     . |          |                 |
==========================================================================================================================================

pattern (strt  miss{32} )
==================================================================================================================================================================
| Id |                     Operation                      | Name  |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   |  Mem  | Activity | Activity Detail |
|    |                                                    |       | (Estim) |       | Active(s) | Active |       | (Actual) | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================================
|  0 | SELECT STATEMENT                                   |       |         |       |        15 |    +16 |     1 |        1 |     . |          |                 |
|  1 |   SORT AGGREGATE                                   |       |       1 |       |        15 |    +16 |     1 |        1 |     . |          |                 |
|  2 |    VIEW                                            |       |      1M | 12147 |        15 |    +16 |     1 |     997K |     . |          |                 |
|  3 |     MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO |       |      1M | 12147 |        29 |     +2 |     1 |     997K |  34MB |          |                 |
|  4 |      VIEW                                          |       |      1M |   325 |        16 |     +1 |     1 |       1M |     . |          |                 |
|  5 |       INDEX FAST FULL SCAN                         | T1_I1 |      1M |   325 |         9 |     +8 |     1 |       1M |     . |          |                 |
==================================================================================================================================================================

The significant line to note is operation 3 in both cases. The query with the pattern that’s going to induce back-tracking reports only “MATCH RECOGNIZE SORT”. The query with the “fixed” pattern reports “MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO” Oracle can implement a finite state machine with a fixed worst case run-time. When you write some code that uses match_recognize() the three magic words you want to see in the plan are “deterministic finite auto” – if you don’t then, in principle, your query might be one of those that could (theoretically) run forever.

Addendum

Congratulations if you’ve got this far – but please remember that I’ve had very little practice using match_recognize; this was a little fun and learning experience for me and there may be many more things you could usefully know about the technology before you use it in production; there may also be details in what I’ve written which are best forgotten about. That being the case I’d be more than happy for anyone who wants to enhance or correct my code, descriptions and observations to comment below.

 

March 6, 2018

Match_recognise – 2

Filed under: 12c,Match_recognize,Oracle,Tuning — Jonathan Lewis @ 7:59 am GMT Mar 6,2018

In my previous post I presented a warning about the potential cost of sorting and the cost of failing to find a match after each pass of a long search. In a comment on that post Stew Ashton reminded me that the cost of repeatedly trying to find a match starting from “the next row down” could be less of a threat than the cost of “back-tracking” before moving to the next row down.

Taking the example from the previous posting to explain – the requirement was for customers who had executed a transaction in September but not October, and a match_recognize() clause suggested on the ODC (formerly OTN) database forum to implement this requirement was as follows:

match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
        pattern(x+ y* $) 
        define
                x as mth  = 9,
                y as mth != 10
);

In the blog post I raised the issue of an extreme case where there were 100,000 transactions for a single customer of which all but the last was a September transaction and the last was an October transaction. This would have two effects – first that we could have to sort 100,000 rows, including the cust_id that appeared in the “partition by” clause and the 1000-character padding column that was listed in the measures clause, leading to a huge dump to, and constant re-read of, the temporary tablespace; secondly that having failed to find a match starting from row 1 Oracle would go back to row 2 and try again, then to row 3, and so on.

The point that Stew Ashton made was that Oracle doesn’t just “go back to” row 2, it will be unwinding a stack, or reversing out a recursive descent to get there. What this means is that Oracle will fail as it reaches the October at row 100,000 and say “no more X rows, is this a Y row ? no”, backtrack to row 999,998 and say “what if I stop collecting X rows here and start looking for Y rows?”, so it reads row 999,999 as a Y row (since 9 != 10), then finds row 100,000 and fails the match. So it backtracks again to row 999,997 and says “what if I stop collecting X rows here and start looking for Y rows?”, and this time it finds identifies 999,998 and 999,999 as Y rows, then fails on row 100,000.

Remember, this is still part of the attempt to match the pattern starting at row 1 – and there are 999,996 more steps backwards still to go, and the further we go back the further we come forward again until we fail — and there are 999,998 steps we have to back-track before we start to consider a pattern starting are row 2..

To demonstrate the costs I’ve got three variants of the original query. First, the query as it was but limited to just 1,000 rows for a single customer; second a changed pattern that highlights the cost of trying to use back-tracking to match the pattern just once, starting from row 1 (the pattern doesn’t actually meet the original requirement because it would only find customers whose first transaction of the year was in September); finally a changed pattern that achieves the required result much more efficiently than the original (but still very slowly) by adding some human intelligence to the implementation.

Here’s version 1 – which took 257 CPU seconds to handle just 1,000 rows:

select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(x+ y* $)
        define
                x as mth  = 9,
                y as mth != 10
);

You’ll see that I’ve included the “debug” functions of classifier() and match_number() in the SQL above – these are particularly useful with the options “all rows per match” and “with unmatched rows” when you’re trying to figure out why your match_recognize() clause is not producing the right results, so I’ve left them there purely for reference.

Then there’s a version where I’ve made the modification suggested by Stew Ashton to demonstrate the full cost of an attempt to match only if the pattern starts on the first row of the partition. This took just 0.83 CPU seconds to complete. This may sound fairly reasonable, but if you compare that to the time it might take simply to sort and walk once through 1,000 rows you’ll realise that it’s actually pretty expensive – and it’s not surprising that when we had to do the same thing 1,000 times (on a slowly decreasing set, of course, as we work our way down the partition) the original task took 257 CPU seconds.

select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(^ x+ y* $)
        define
                x as mth  = 9,
                y as mth != 10
);

You’ll notice the caret “^” at the start of the pattern – this means the pattern must start at the first row of the partition (just as the “$” means the pattern has to end at the end of the partition).

Finally, thinking of a better way of using match_recognize() for this requirement we realise that we know that November comes after October, and December comes after November so (in the context of our example) the predicate “!= 10” is equivalent to “> 10”. With this code change the original query took 0.82 CPU seconds.


select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(x+ y* $)
        define
                x as mth  = 9,
                y as mth  > 10
);

In this case we still have to do a lot of back tracking, but each time we backtrack one step we then check just one row forward for the match to fail (9 is not greater than 10), whereas with the original if we have backtracked 750 steps (for example) we would then have to check 750 rows before we reached the October row for the match to fail.

Bottom line: back-tracking is a massive cost if you have to take a lot of steps backwards to the previous starting row; and you need the match to fail (or succeed) as fast as possible as you start stepping forward again.

Addendum

Since Stew Ashton had highlighted the omission in the previous blog post I passed him a copy of this post before publishing it, asking him to check whether there were any errors or omissions in the way I had described the work Oracle would do back tracking in this example. He said that he couldn’t think of anything to improve the explanation (though I will still claim responsibility for any errors, omissions, or ambiguities) and then suggested another, far more efficient, way of getting the required answer by (again) re-thinking the question before writing the code. His solution looks like this:


select  *
from    (
        select
                t1.*,
                extract(year from trans_dt) yr,
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt nulls first
        measures
                padding as t1_padding,
                classifier() cl,
                match_number() as mn
        pattern(x{1})
        define
                x as mth  = 9 and (
                             next(mth) is null or next(mth) > 10
                     )
)
;

The pattern here simply says: for the partition find the first “X”-row, but an X-row is defined as “month is september and either there are no more rows or the next row is after October”. You’ll notice that I’ve modified the “order by” clause to put nulls first – there are none in the sample data, but if there were this change to the order would ensure that for a row where “mth = 9″ the “next(mth)” could only be null if the current row were the last in the partition.

If you imagine walking through the pattern-matching process now, you start looking at rows and keep going until you reach the first September, and each time you find a September you check the next row to see if it’s past the end of partition, or a November or December; if it is you report the current row and move to the end of the partition, if it isn’t you just walk to the next row and repeat the process – you never back-track. Effectively the workload here is simply to sort then walk non-stop through the whole list – and Oracle even tells us that we are using this optimum strategy in the execution plan:


---------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                       | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                |      |      1 |        |      0 |00:00:00.01 |     146 |       |       |          |
|   1 |  VIEW                                           |      |      1 |   1000 |      0 |00:00:00.01 |     146 |       |       |          |
|   2 |   MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO|      |      1 |   1000 |      0 |00:00:00.01 |     146 |  1186K|   567K| 1054K (0)|
|   3 |    VIEW                                         |      |      1 |   1000 |   1000 |00:00:00.01 |     146 |       |       |          |
|   4 |     TABLE ACCESS FULL                           | T1   |      1 |   1000 |   1000 |00:00:00.01 |     146 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------

Operation 2 – the Match Recognize Sort operation – is reported as “deterministic finite auto”, which basically means the duration of the process is predictable because Oracle knows it is a straight end to end walk with no back-tracking. This is the ideal thing to see when you try to design code using match_recognize().

February 26, 2018

Match_recognize

Filed under: 12c,Match_recognize,Oracle — Jonathan Lewis @ 2:59 pm GMT Feb 26,2018

In the spirit of Cary Millsap’s comment: “The fastest way to do anything is to not do it at all”, here’s my take (possibly not an original one) on solving problems:

“The best time to solve a problem is before it has happened.”

I spend quite a lot of my “non-contact” time thinking about boundary cases, feature collisions, contention issues, and any other things that could go wrong when you start to implement real systems with (new) Oracle features. The benefit of doing this, of course, is that when I’m looking at a client’s system I can often solve problems because I recognise symptoms that I’ve previously created “in the lab”. The strange thing about this is that there have been times when I’ve pushed Oracle to a breaking point, documented it, and then dismissed the threat because “no one would do that in real life” only to find that someone has done it in real life.

All this is just a preamble to a demonstration of a threat with a terrific feature that is just beginning to gain greater acceptance as a solution to some interesting problems – and the demonstration is going to exaggerate the problem to a level that (probably) won’t appear in a production. The driving example appeared as a question on the OTN/ODC database forum:

“I need customers who have done a transaction in September but not in October.”

There are obviously many ways to address this type of requirement (my first thought was to use the MINUS operator), and a few questions you might ask before trying to address it, but the OP had supplied some data to play which consisted of just a few rows of a table with three columns and some data restricted to just one year, and one solution offered was a very simple query using the 12c feature match_recognize():


CREATE TABLE TEST_TABLE   
  ( T_ID NUMBER, -- trans-id  
    CUST_ID NUMBER,   
    TRANS_DT DATE  
  ) ;  
                  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (1,100,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (2,100,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (3,200,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (4,300,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (5,400,to_date('12-JAN-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (6,500,to_date('12-OCT-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (7,500,to_date('12-MAR-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (8,600,to_date('12-SEP-17','DD-MON-RR'));  
Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (9,600,to_date('12-JUL-17','DD-MON-RR'));  

commit;

select * from test_table
match_recognize
(
  partition by cust_id
  order by trans_dt
  pattern( x+ y* $)
  define
    x as extract(month from trans_dt)  = 9,
    y as extract(month from trans_dt) != 10
);
 
   CUST_ID
----------
       200
       600
      

The obvious benefit of this solution over a solution involving a set-wise MINUS is that it need only scan the data set once (whereas the MINUS strategy will be scanning it twice with a select distinct in each scan) – but it’s a solution that is likely to be unfamiliar to many people and may need a little explanation.

The partition by cust_id order by trans_dt means we sort the data by those two columns, breaking on cust_id. Then for each cust_id we walk through the data looking for a pattern which is defined as: “one or more rows where the month is september followed by zero or more rows where the month is NOT october followed by the end of the set for the customer”. The SQL leaves many details to default so the result set is just the cust_id column and only one row per occurrence of the pattern (which, given the data set, can occur at most once per customer).

For a cust_id that shows a matching pattern the work we will have done is:

  • Walk through rows for Jan to Aug until we reach the first September – which is the start of pattern
  • Keep on walking through to the last of the Septembers – which is a partial match
  • One of
  • Walk through zero rows of November and December and reach the end of cust_id
  • Walk through one or more rows of November and/or December then reach the end of cust_id
  • Record the end of pattern by reporting one row
  • Move on to next cust_id

The excitement starts when we think about a cust_id that doesn’t have a matching pattern – and for that I’m going to generate a new, extreme, data set.


rem
rem     Script:         match_recognize_07.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level  comment to avoid WordPress format issue
)
select
        rownum                          id,
        99                              cust_id,
        to_date('01-Sep-2017')          trans_dt,
        lpad(rownum,1000,'0')           padding
from
        generator       v1,
        generator       v2
where
        rownum  comment to avoid WordPress format issue
;

update t1
set
        trans_dt = to_date('01-Oct-2017','dd-mon-yyyy')
where
        rownum = 1
;

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

select  *
from    (
        select 
                t1.*,
                extract(year from trans_dt) yr, 
                extract(month from trans_dt) mth
        from
                t1
        )
match_recognize
(
        partition by cust_id
        order by trans_dt
        measures
                padding as t1_padding
        pattern( x+  y*  $ )
        define
                x as mth = 9,
                y as mth != 10
);

I’ve moved the calculation of month number from the define clause into an in-line view purely to make the match_recognize() clause a little tidier.

I’ve created a table with just one customer with 100,000 transactions on 1st September 2017, then I’ve updated one row from September to October. Thanks to that one row Oracle is not going to be able to find the requested pattern. I’ve added a padding column of 1,000 characters to the table and included it in the measures that I want to select, so Oracle will have to sort roughly 100MB of data (100,000 rows at roughly 1KB per row) before it starts walking the data to find matches – and, though it’s not visible in the script, the workarea settings mean the session won’t be allowed to expand its PGA to accommodate the whole 100MB.

Test 1 – comment out the update and see how long it takes to produce a result: 0.67 seconds, and the padding value reported was the last one from the pattern.
Test 2 – put the update back in place and try again:

After running for 46 seconds with no result and interrupting the query these are some figures from a snapshot of the session stats:

Name                                                 Value
----                                                 -----
CPU used when call started                           3,662
DB time                                              3,711
user I/O wait time                                   1,538
consistent gets                                     14,300
physical reads direct                            1,298,939
physical read IO requests                          736,478
physical read bytes                         10,640,908,288      
physical writes                                     25,228
physical writes direct                              25,228
physical reads direct temporary tablespace       1,298,939
physical writes direct temporary tablespace         25,228
table scan rows gotten                             100,000
table scan blocks gotten                            14,286

  • I’ve scanned a table of 14,286 blocks to find 100,000 rows.
  • I’ve sorted and spilled to disc, using roughly 25,000 blocks of direct path writes and reads to do the sort.
  • Then I’ve spend the rest of the time burning up CPU and reading 1.27 million blocks from the temporary tablespace trying to find a match

The way that basic pattern matching works on a match failure is to go back to the row after the one where the current match attempt started, and begin all over again. So in this example, after dumping 100MB of Septembers to temp Oracle started at row 1, read 999,999 rows, then found the October that failed the match; so it went to row 2 [ed: doing some very expensive back-tracking: see comment #2 from Stew Ashton], read 999,998 rows, then found the October that failed the match; so it went to row 3 and so on. Every time it went back to (nearly) the beginning it had to start re-reading that 100,000 rows from temp because the session wasn’t allowed to keep the whole 100MB in memory.

You need to avoid defining a pattern that has to scan large volumes of data to identify a single occurrence of the pattern if the matching process is likely to fail. Even if you can keep the appropriate volume of data in memory for the entire time and avoid a catastrophic volume of reads from the temporary tablespace you can still see a huge amount of CPU being used to process the data – when I reduced the table from 100,000 rows to 10,000 rows it still took me 99 CPU seconds to run the query.

tl;dr

The 12c match_recognize() is a terrific tool, but you must remember two important details about the default behaviour when you think about using it:

  • You will sort a volume of data that is the number of input rows multiplied but the total length of the measures/partition output.
  • If you have a long sequence of rows that ends up failing to match a pattern Oracle goes back to the row after the start of the previous match attempt.

With the usual proviso that “large”, “small” etc. are all relative: keep the data volume small, and try to define patterns that will be short  runs of rows.

Do note, however, that I engineered this example to produce a catastrophe. There are many non-default actions you can choose to minimise the workload you’re likely to produce with match_recognize(), and if you just spare a little time to think about worst case events you probably won’t need to face a scenario like this in a real production environment.

See also:

Part 6 (which includes a list of earlier installments) of an introductory series to match_recognize() by Keith Laker.

A pdf file of Keith Laker’s presentation on match_recognize(), including some technical implementation details.

 

Powered by WordPress.com.