One of the articles I wrote for redgate’s AllthingsOracle site some time ago included a listing of the data distribution for some client data which I had camouflaged. A recent comment on the article asked how I had generated the data – of course the answer was that I hadn’t generated it, but I had done something to take advantage of its existence without revealing the actual values. This article is just a little note showing what I did; it’s not intended as an elegant and stylish display of perfectly optimised SQL, it’s an example of a quick and dirty one-off hack that wasn’t (in my case) a disaster to run.
I’ve based the demonstration on the view all_objects. We start with a simple query showing the distribution of the values of column object_type:
break on report compute sum of count(*) on report select object_type, count(*) from all_objects group by object_type order by count(*) desc ; OBJECT_TYPE COUNT(*) ------------------- ---------- SYNONYM 30889 JAVA CLASS 26447 ... JAVA RESOURCE 865 TRIGGER 509 JAVA DATA 312 ... JAVA SOURCE 2 DESTINATION 2 LOB PARTITION 1 EDITION 1 MATERIALIZED VIEW 1 RULE 1 ---------- sum 76085 44 rows selected.
Starting from this data set I want 44 randomly generated strings and an easy way to translate the actual object type into one of those strings. There are various ways to do this but the code I hacked out put the original query into an inline view, surrounded it with a query that added a rownum to the result set to give each row a unique id, then used the well-known and much-loved “connect by level” query against dual to generate a numbered list of randomly generated strings as an inline view that I could use in a join to do the translation.
execute dbms_random.seed(0) column random_string format a6 select generator.id, dbms_random.string('U',6) random_string, sum_view.specifier, sum_view.ct "COUNT(*)" from ( select rownum id from dual connect by level <= 100 ) generator, ( select rownum id, specifier, ct from ( select object_type specifier, count(*) ct from all_objects group by object_type order by count(*) desc ) ) sum_view where sum_view.id = generator.id order by ct desc ; ID RANDOM SPECIFIER COUNT(*) ---------- ------ ------------------- ---------- 1 BVGFJB SYNONYM 30889 2 LYYVLH JAVA CLASS 26447 ... 9 DNRYKC JAVA RESOURCE 865 10 BEWPEQ TRIGGER 509 11 UMVYVP JAVA DATA 312 ... 39 EYYFUJ JAVA SOURCE 2 40 SVWKRC DESTINATION 2 41 CFKBRX LOB PARTITION 1 42 ZWVEVH EDITION 1 43 DDAHZX MATERIALIZED VIEW 1 44 HFWZBX RULE 1 ---------- sum 76085 44 rows selected.
I’ve selected the id and original value here to show the correspondance, but didn’t need to show them in the original posting. I’ve also left the original (now redundant) “order by” clause in the main inline view, and you’ll notice that even though I needed only 44 distinct strings for the instance I produced the results on I generated 100 values as a safety margin for testing the code on a couple of other versions of Oracle.
A quick check for efficiency – a brief glance at the execution plan, which might have prompted me to add a couple of /*+ no_merge */ hints if they’d been necessary – showed that the work done was basically the work of the original query plus a tiny increment for adding the rownum and doing the “translation join”. Of course, if I’d then wanted to translate the full 76,000 row data set and save it as a table I’d have to join the result set above back to a second copy of all_objects – and it’s translating full data sets , while trying to deal with problems of referential integrity and correlation, where the time disappears when masking data.
It is a minor detail of this code that it produced fixed length strings (which matched the structure of the original client data). Had I felt the urge I might have used something like: dbms_random.string(‘U’,trunc(dbms_random.value(4,21))) to give me a random distribution of string lengths between 4 and 20. Getting fussier I might have extracted the distinct values for object_type and then generated a random string that matched the length of the value it was due to replace. Fussier still I might have generated the right number of random strings matching the length of the longest value, sorted the original and random values into alphabetical order to align them, then trimmed each random value to the length of the corresponding original value.
It’s extraordinary how complicated it can be to mask data realistically – even when you’re looking at just one column in one table. And here’s a related thought – if an important type of predicate in the original application with the original data is where object_type like ‘PACK%’ how do you ensure that your masked data is consistent with the data that would be returned by this query and how do you determine the value to use instead of “PACK” as the critical input when you run the critial queries against the masked data ? (Being privileged may give you part of the answer, but bear in mind that the people doing the testing with that data shouldn’t be able to see the unmasked data or any translation tables.)