In the spirit of Cary Millsap’s comment: “The fastest way to do anything is to not do it at all”, here’s my take (possibly not an original one) on solving problems:
“The best time to solve a problem is before it has happened.”
I spend quite a lot of my “non-contact” time thinking about boundary cases, feature collisions, contention issues, and any other things that could go wrong when you start to implement real systems with (new) Oracle features. The benefit of doing this, of course, is that when I’m looking at a client’s system I can often solve problems because I recognise symptoms that I’ve previously created “in the lab”. The strange thing about this is that there have been times when I’ve pushed Oracle to a breaking point, documented it, and then dismissed the threat because “no one would do that in real life” only to find that someone has done it in real life.
All this is just a preamble to a demonstration of a threat with a terrific feature that is just beginning to gain greater acceptance as a solution to some interesting problems – and the demonstration is going to exaggerate the problem to a level that (probably) won’t appear in a production. The driving example appeared as a question on the OTN/ODC database forum:
“I need customers who have done a transaction in September but not in October.”
There are obviously many ways to address this type of requirement (my first thought was to use the MINUS operator), and a few questions you might ask before trying to address it, but the OP had supplied some data to play which consisted of just a few rows of a table with three columns and some data restricted to just one year, and one solution offered was a very simple query using the 12c feature match_recognize():
CREATE TABLE TEST_TABLE ( T_ID NUMBER, -- trans-id CUST_ID NUMBER, TRANS_DT DATE ) ; Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (1,100,to_date('12-SEP-17','DD-MON-RR')); Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (2,100,to_date('12-OCT-17','DD-MON-RR')); Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (3,200,to_date('12-SEP-17','DD-MON-RR')); Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (4,300,to_date('12-OCT-17','DD-MON-RR')); Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (5,400,to_date('12-JAN-17','DD-MON-RR')); Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (6,500,to_date('12-OCT-17','DD-MON-RR')); Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (7,500,to_date('12-MAR-17','DD-MON-RR')); Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (8,600,to_date('12-SEP-17','DD-MON-RR')); Insert into TEST_TABLE (T_ID,CUST_ID,TRANS_DT) values (9,600,to_date('12-JUL-17','DD-MON-RR')); commit; select * from test_table match_recognize ( partition by cust_id order by trans_dt pattern( x+ y* $) define x as extract(month from trans_dt) = 9, y as extract(month from trans_dt) != 10 ); CUST_ID ---------- 200 600
The obvious benefit of this solution over a solution involving a set-wise MINUS is that it need only scan the data set once (whereas the MINUS strategy will be scanning it twice with a select distinct in each scan) – but it’s a solution that is likely to be unfamiliar to many people and may need a little explanation.
The partition by cust_id order by trans_dt means we sort the data by those two columns, breaking on cust_id. Then for each cust_id we walk through the data looking for a pattern which is defined as: “one or more rows where the month is september followed by zero or more rows where the month is NOT october followed by the end of the set for the customer”. The SQL leaves many details to default so the result set is just the cust_id column and only one row per occurrence of the pattern (which, given the data set, can occur at most once per customer).
For a cust_id that shows a matching pattern the work we will have done is:
- Walk through rows for Jan to Aug until we reach the first September – which is the start of pattern
- Keep on walking through to the last of the Septembers – which is a partial match
- One of
- Walk through zero rows of November and December and reach the end of cust_id
- Walk through one or more rows of November and/or December then reach the end of cust_id
- Record the end of pattern by reporting one row
- Move on to next cust_id
The excitement starts when we think about a cust_id that doesn’t have a matching pattern – and for that I’m going to generate a new, extreme, data set.
rem rem Script: match_recognize_07.sql rem Author: Jonathan Lewis rem Dated: Feb 2018 rem create table t1 nologging as with generator as ( select rownum id from dual connect by level comment to avoid WordPress format issue ) select rownum id, 99 cust_id, to_date('01-Sep-2017') trans_dt, lpad(rownum,1000,'0') padding from generator v1, generator v2 where rownum comment to avoid WordPress format issue ; update t1 set trans_dt = to_date('01-Oct-2017','dd-mon-yyyy') where rownum = 1 ; begin dbms_stats.gather_table_stats( ownname => user, tabname => 'T1', method_opt => 'for all columns size 1' ); end; / select * from ( select t1.*, extract(year from trans_dt) yr, extract(month from trans_dt) mth from t1 ) match_recognize ( partition by cust_id order by trans_dt measures padding as t1_padding pattern( x+ y* $ ) define x as mth = 9, y as mth != 10 );
I’ve moved the calculation of month number from the define clause into an in-line view purely to make the match_recognize() clause a little tidier.
I’ve created a table with just one customer with 100,000 transactions on 1st September 2017, then I’ve updated one row from September to October. Thanks to that one row Oracle is not going to be able to find the requested pattern. I’ve added a padding column of 1,000 characters to the table and included it in the measures that I want to select, so Oracle will have to sort roughly 100MB of data (100,000 rows at roughly 1KB per row) before it starts walking the data to find matches – and, though it’s not visible in the script, the workarea settings mean the session won’t be allowed to expand its PGA to accommodate the whole 100MB.
Test 1 – comment out the update and see how long it takes to produce a result: 0.67 seconds, and the padding value reported was the last one from the pattern.
Test 2 – put the update back in place and try again:
After running for 46 seconds with no result and interrupting the query these are some figures from a snapshot of the session stats:
Name Value ---- ----- CPU used when call started 3,662 DB time 3,711 user I/O wait time 1,538 consistent gets 14,300 physical reads direct 1,298,939 physical read IO requests 736,478 physical read bytes 10,640,908,288 physical writes 25,228 physical writes direct 25,228 physical reads direct temporary tablespace 1,298,939 physical writes direct temporary tablespace 25,228 table scan rows gotten 100,000 table scan blocks gotten 14,286
- I’ve scanned a table of 14,286 blocks to find 100,000 rows.
- I’ve sorted and spilled to disc, using roughly 25,000 blocks of direct path writes and reads to do the sort.
- Then I’ve spend the rest of the time burning up CPU and reading 1.27 million blocks from the temporary tablespace trying to find a match
The way that basic pattern matching works on a match failure is to go back to the row after the one where the current match attempt started, and begin all over again. So in this example, after dumping 100MB of Septembers to temp Oracle started at row 1, read 999,999 rows, then found the October that failed the match; so it went to row 2 [ed: doing some very expensive back-tracking: see comment #2 from Stew Ashton], read 999,998 rows, then found the October that failed the match; so it went to row 3 and so on. Every time it went back to (nearly) the beginning it had to start re-reading that 100,000 rows from temp because the session wasn’t allowed to keep the whole 100MB in memory.
You need to avoid defining a pattern that has to scan large volumes of data to identify a single occurrence of the pattern if the matching process is likely to fail. Even if you can keep the appropriate volume of data in memory for the entire time and avoid a catastrophic volume of reads from the temporary tablespace you can still see a huge amount of CPU being used to process the data – when I reduced the table from 100,000 rows to 10,000 rows it still took me 99 CPU seconds to run the query.
tl;dr
The 12c match_recognize() is a terrific tool, but you must remember two important details about the default behaviour when you think about using it:
- You will sort a volume of data that is the number of input rows multiplied but the total length of the measures/partition output.
- If you have a long sequence of rows that ends up failing to match a pattern Oracle goes back to the row after the start of the previous match attempt.
With the usual proviso that “large”, “small” etc. are all relative: keep the data volume small, and try to define patterns that will be short runs of rows.
Do note, however, that I engineered this example to produce a catastrophe. There are many non-default actions you can choose to minimise the workload you’re likely to produce with match_recognize(), and if you just spare a little time to think about worst case events you probably won’t need to face a scenario like this in a real production environment.
See also:
Part 6 (which includes a list of earlier installments) of an introductory series to match_recognize() by Keith Laker.
A pdf file of Keith Laker’s presentation on match_recognize(), including some technical implementation details.
And a link to volume 1 of a (free) iBook that Keith Laker has written on the topic
It seems that our regex knowledge from languages like (especially) Perl might come in handy again, when it comes to avoiding backtracking swamps. Although Oracle might not (yet) supply us with all the tools that Perl did to avoid them.
Comment by Jason B. — February 26, 2018 @ 8:24 pm GMT Feb 26,2018 |
Like the example for match_recognize() but find it is not available in 11g.
Below is the solution I came up with in 11g, it produces the same result;
with x as (
select cust_id ,
listagg(extract(month from trans_dt), ‘ ‘) within group (order by trans_dt) e_seq
from t_tb11
group by cust_id
)
select *
from x
where regexp_like(e_seq, ‘9’)
and not regexp_like(e_seq, ’10’)
;
CUST_ID E_SEQ
200 9
600 7 9
Comment by Ron2018Tweet (@Ron2018Tweet) — February 28, 2018 @ 7:30 pm GMT Feb 28,2018 |
Ron,
Thanks for the comment – I had thought I’d mentioned that match_recognize() was a “new” 12c feature but checking back I hadn’t; so I’ve now added the reminder in a couple of places.
An interesting detail of your solution – the filter in the predicate section is a little odd:
I wonder if anyone’s written anything about a strange hextoraw() third parameter within the regexp_like()
Comment by Jonathan Lewis — March 1, 2018 @ 11:17 am GMT Mar 1,2018 |
Hi.
I wonder why anyone thought to use match_recognize in such simple case – syntax is ‘similar’ with model which eats memory like kids candys (or even faster).
REGEXP is also expensive, listagg has limit 4k bytes (it had error when tested with ‘big’ test).
In my mind simpler solutions:
1. simple aggregation (extract can be with to_char if some one prefer it more)
2. using analytic functions if some one needs entire row (with extra cols) in result
Comment by Paweł — March 1, 2018 @ 5:40 pm GMT Mar 1,2018 |
Pawel,
A good reason for someone to ask about it in such a simple case is that it might be a very simplified version of actual requirements or it might be an exercise in understanding a mechanism by using a conceptually simple fake requirement.
A reason for suggesting it in response to a simple question is that you might suspect the question was a simplified version of a requirement where match_recognize() would be a good solution.
A reason for using it in a blog is that it makes it easy to demonstrate that a technique that can be incredibly useful in some circumstances can be a total abomination in other circumstances.
Just because someone says: “how do you do X to data set Y” is no reason to assume that X is exactly what they want to do or that Y is exactly the data set they want to do it to.
Comment by Jonathan Lewis — March 1, 2018 @ 7:44 pm GMT Mar 1,2018
HI Jonathan,
Thanks for this post with two important conclusions. A few comments:
1) “You will sort a volume of data that is the number of input rows multiplied [by] the total length of the measures/partition output.”
Exactly. In this case, you added the MEASURES clause with the PADDING column, otherwise the sort area would be small.
We have to watch out for the ALL ROWS PER MATCH option, which implicitly returns all the input columns. However, if the overall SELECT does not return the PADDING column, Oracle is smart enough to remove it from the projected columns at the table access step!
Note that the same goes for analytic functions: the sort area sizes are identical.
2) “If you have a long sequence of rows that ends up failing to match a pattern Oracle goes back to the row after the start of the previous match attempt.”
That only starts to explain the CPU / temp usage you measured. As mentioned above, the culprit is backtracking:
– get X rows 1 through 99,999
– the next row is not a Y row and we are not at the end of the partition, so:
– “give back” the X row 99,999
– the Y / end of partition condition is not met, so:
– “give back” the X row 99,998
etc., etc.
This “giving back” is what they call backtracking. The first match attempt, starting at the first row, is already doing loads of work.
Put a ^ at the beginning of your pattern. That will guarantee that the first match attempt is the only one. You will still see lots of work being done.
Best regards, Stew
Comment by stewashton — March 2, 2018 @ 1:54 pm GMT Mar 2,2018 |
Stew,
Thanks for the comment – especially the second one: I deliberately avoided saying anything about “greedy” vs. “reluctant”, but I’d completely overlooked the impact of the requirement to back-track all the way back to the “next” row.
I still have to put in lots of practice before I’ll be able to interpret or define a pattern as casually as I do an execution plan. Still struggling to grasp the pattern you’ve just posted into twitter: http://twitter.com/StewAshton/status/969540355250180096
Comment by Jonathan Lewis — March 2, 2018 @ 2:55 pm GMT Mar 2,2018 |
Jonathan,
The pattern I posted was wrong, as Iudith Mentzel pointed out. I failed to apply correctly part of the presentation you attended at UKOUG Tech 17 !
The link to twitter leads to a corrected version on livesql.
Sorry for the unnecessary struggle.
Best regards, Stew
Comment by stewashton — March 9, 2018 @ 10:08 am GMT Mar 9,2018 |
[…] In my previous post I presented a warning about the potential cost of sorting and the cost of failing to find a match after each pass of a long search. In a comment on that post Stew Ashton reminded me that the cost of repeatedly trying to find a match starting from “the next row down” could be less of a threat than the cost of “back-tracking” before moving to the next row down. […]
Pingback by Match_recognise – 2 | Oracle Scratchpad — March 6, 2018 @ 7:59 am GMT Mar 6,2018 |
Nice article! Especially with the comments and part 2!
A little typo in second example’s 59th row (a comma at the end, maybe from part 2 classifier(), match_number() erase).
Comment by ricsi — March 6, 2018 @ 2:14 pm GMT Mar 6,2018 |
ricsi,
Thanks for the note.
Typo now corrected – and you’re right, it was from deleting those two references after pasting.
Comment by Jonathan Lewis — March 6, 2018 @ 2:32 pm GMT Mar 6,2018 |
[…] at that option until someone paid me to. More seriously I did wonder about the option for using match_recognize to bypass the lag() problem, and may get around to thinking about that more carefully at some […]
Pingback by Lag/Lead slow | Oracle Scratchpad — May 6, 2022 @ 2:57 pm BST May 6,2022 |