Data Hashing

April 9, 2018

Data Hashing

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 6:10 pm BST Apr 9,2018

Here’s a little-known feature that has been around since at least Oracle 10, though I don’t think I had ever seen it in the wild until today when someone reported on the ODC (OTN) database forum that they had a problem getting repeatable results. It’s always possible, of course, that failure to get repeatable results is the natural consequence of running queries against a multi-user system, but if we assume that this was not the cause in this case we have to ask why a special hashing function that Oracle supplies to allow you to check that a set of data hasn’t changed gives you different results when “the data hasn’t changed”.

I’m talking about the function dbms_sqlhash.gethash() – a packaged function that exists in the SYS schema and isn’t usually exposed to other users. The function takes as its inputs the text of query, a selected hashing function, and a “chunk” size. It will run the query, and use the hashing function to return a single, 16 – 64 byte, hash value representing the entire result set. Here’s an example of usage:


begin
        dbms_output.put_line(
                dbms_sqlhash.gethash(
                        sqltext     => 'select n1, d1 from t1 where id > 0',
                        digest_type => dbms_crypto.hash_md5
                        -- chunk_size  => 128*1048576   -- default 128MB
                )
        );
end;
/

6496D2438FECA960B1E916BF8C4BADCA

I haven’t specified a chunk size – the default is 128MB – and Oracle will hash this much of the result set in a single pass. If the result set is larger than this Oracle will hash each chunk in turn then generate a hash of the hash values. (This means, by the way, that changing the chunk size can change the hash value for large data sets).

There are 6 possible digest types in 12.1.0.2 (listed in the $ORACLE_HOME/rdbms/admin/dbmsobtk.sql script that creates the dbms_crypto package – so you will need the execute privilege on both dbms_sqlhash and dbms_crypto to use the function if you want to code with symbolic constants):

rem         HASH_MD4           CONSTANT PLS_INTEGER            :=     1;
rem         HASH_MD5           CONSTANT PLS_INTEGER            :=     2;
rem         HASH_SH1           CONSTANT PLS_INTEGER            :=     3;
rem         HASH_SH256         CONSTANT PLS_INTEGER            :=     4;
rem         HASH_SH384         CONSTANT PLS_INTEGER            :=     5;
rem         HASH_SH512         CONSTANT PLS_INTEGER            :=     6;

Let’s put the whole thing into a demonstration that will allow us to see an important point – you have to be careful with your query:


rem
rem     Script:         gethash.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2016
rem

execute dbms_random.seed(0)

create table t1
nologging
as
select
        1e4 - rownum                    id,
        trunc(dbms_random.value(0,100)) n1,
        trunc(sysdate)                  d1,
        lpad('x',100,'x')               padding
from
        dual
connect by
        level <= 1e4 -- > comment to avoid WordPress format issue
;

alter table t1 add constraint t1_pk primary key (id);

begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

set feedback off

alter system flush shared_pool;
alter session set optimizer_mode = first_rows_1;

begin
        dbms_output.put_line(
                dbms_sqlhash.gethash(
                        sqltext     => 'select n1, d1 from t1 where id > 0',
                        digest_type => dbms_crypto.hash_md5
                        -- chunk_size  => 128*1048576   -- default 128MB
                )
        );
end;
/

alter system flush shared_pool;
alter session set optimizer_mode = all_rows;

begin
        dbms_output.put_line(
                dbms_sqlhash.gethash(
                        sqltext     => 'select n1, d1 from t1 where id > 0',
                        digest_type => dbms_crypto.hash_md5
                        -- chunk_size  => 128*1048576   -- default 128MB
                )
        );
end;
/

alter system flush shared_pool;
alter session set nls_date_format='dd-mon-yyyy hh24:mi:ss';

begin
        dbms_output.put_line(
                dbms_sqlhash.gethash(
                        sqltext     => 'select n1, d1 from t1 where id > 0',
                        digest_type => dbms_crypto.hash_md5
                        -- chunk_size  => 128*1048576   -- default 128MB
                )
        );
end;
/

alter session set nls_date_format='DD-MON-RR';

I’ve created a data set, added a primary key, and gathered stats, then I’ve called the same hashing function on the same sql statement three times in a row. However, I’ve changed the session environment for each call – in the first case I’ve set the optimizer to “first rows(1)” optimization, then I’ve set the optimizer back to all_rows, then I’ve changed the nls_date_format from its default of “DD-MON-RR” (and that’s significant because I’ve got a date column in my query). Here are the output from running the script:


Table created.


Table altered.


PL/SQL procedure successfully completed.

6496D2438FECA960B1E916BF8C4BADCA
D41D4A2945D0B89A6C5DEB5060189A54
ECC3D2B66CB61821397CD9BD983FD5F4

The query has to return the same data content in all three cases – but the hash value is different in the three cases. The change in the optimizer mode has affected the order in which the data was returned (with first_rows(1) Oracle did a full scan of the primary key index, with all_rows it did a tablescan and sort); the change in the nls_XXX parameter meant the internal representation of the data changed.

You have to be very careful with dbms_sqlhash every time you use it if you want the same data set to produce the same result. First, to be safe, you need to ensure that you always use the same NLS parameters when using the function; then you need to have an “order by” clause in the query, and the columns used in the order by clause need to a possible candidate key (i.e. unique, not null) for the data otherwise a change in the optimizer parameters, or the object stats, could result in a change in execution plan with an ensuing change in the actual order of the data and a different hash value.

Oracle Scratchpad

April 9, 2018

Data Hashing

Leave a Comment »

Comments and related questions are welcome. Cancel reply

Click on the motto below

Everything Changes Eventually

Search this blog

Categories

Special Links

Recent Posts

Recent Comments

Popular articles

Popular References

Archives

Blogroll

Posts by RSS

Oracle Scratchpad

April 9, 2018

Data Hashing

Share this:

Related

Leave a Comment »

Comments and related questions are welcome. Cancel reply

Click on the motto below

Everything Changes Eventually

Search this blog

Categories

Special Links

Recent Posts

Recent Comments

Popular articles

Popular References

Archives

Blogroll

Posts by RSS