In previous installments of this series I’ve been describing how Oracle estimates the join cardinality for single column joins with equality where the columns have histograms defined. So far I’ve covered two options for the types of histogram involved: frequency to frequency, and frequency to top-frequency. Today it’s time to examine frequency to hybrid.
My first thought about this combination was that it was likely to be very similar to frequency to top-frequency because a hybrid histogram has a list of values with “repeat counts” (which is rather like a simple frequency histogram), and a set of buckets with variable sizes that could allow us to work out an “average selectivity” of the rest of the data.
I was nearly right but the arithmetic didn’t quite work out the way I expected. Fortunately Chinar Aliyev’s document highlighted my error – the optimizer doesn’t use all the repeat counts, it uses only those repeat counts that identify popular values, and a popular value is one where the endpoint_repeat_count is not less than the average number of rows in a bucket. Let’s work through an example – first the data (which repeats an earlier article, but is included here for ease of reference):
rem rem Script: freq_hist_join_06.sql rem Author: Jonathan Lewis rem Dated: Oct 2018 rem set linesize 156 set pagesize 60 set trimspool on execute dbms_random.seed(0) create table t1 ( id number(6), n04 number(6), n05 number(6), n20 number(6), j1 number(6) ) ; create table t2( id number(8,0), n20 number(6,0), n30 number(6,0), n50 number(6,0), j2 number(6,0) ) ; insert into t1 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 4) + 1 n04, mod(rownum, 5) + 1 n05, mod(rownum, 20) + 1 n20, trunc(2.5 * trunc(sqrt(v1.id*v2.id))) j1 from generator v1, generator v2 where v1.id <= 10 -- > comment to avoid WordPress format issue and v2.id <= 10 -- > comment to avoid WordPress format issue ; insert into t2 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 20) + 1 n20, mod(rownum, 30) + 1 n30, mod(rownum, 50) + 1 n50, 28 - round(abs(7*dbms_random.normal)) j2 from generator v1 where rownum <= 800 -- > comment to avoid WordPress format issue ; commit; begin dbms_stats.gather_table_stats( ownname => null, tabname => 'T1', method_opt => 'for all columns size 1 for columns j1 size 254' ); dbms_stats.gather_table_stats( ownname => null, tabname => 'T2', method_opt => 'for all columns size 1 for columns j2 size 13' ); end; /
As before I’ve got a table with 100 rows using the sqrt() function to generate column j1, and a table with 800 rows using the dbms_random.normal function to generate column j2. So the two columns have skewed patterns of data distribution, with a small number of low values and larger numbers of higher values – but the two patterns are different.
I’ve generated a histogram with 254 buckets (which dropped to 10) for the t1.j1 column, and generated a histogram with 13 buckets for the t2.j2 column as I knew (after a little trial and error) that this would give me a hybrid histogram.
Here’s a simple query, with its result set, to report the two histograms – using a full outer join to line up matching values and show the gaps where (endpoint) values in one histogram do not appear in the other:
define m_popular = 62 break on report skip 1 compute sum of product on report compute sum of product_rp on report compute sum of t1_count on report compute sum of t2_count on report compute sum of t2_repeats on report compute sum of t2_pop_count on report with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number, endpoint_repeat_count, to_number(null) from user_tab_histograms where table_name = 'T1' and column_name = 'J1' order by endpoint_value ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number, endpoint_repeat_count, case when endpoint_repeat_count >= &m_popular then endpoint_repeat_count else null end pop_count from user_tab_histograms where table_name = 'T2' and column_name = 'J2' order by endpoint_value ) select f1.value t1_value, f2.value t2_value, f1.row_or_bucket_count t1_count, f2.row_or_bucket_count t2_count, f1.endpoint_repeat_count t1_repeats, f2.endpoint_repeat_count t2_repeats, f2.pop_count t2_pop_count from f1 full outer join f2 on f2.value = f1.value order by coalesce(f1.value, f2.value) ; T1_VALUE T2_VALUE T1_COUNT T2_COUNT T1_REPEATS T2_REPEATS T2_POP_COUNT ---------- ---------- ---------- ---------- ---------- ---------- ------------ 1 1 1 2 5 0 5 15 0 7 15 0 10 17 0 12 13 0 15 15 13 55 0 11 17 17 11 56 0 34 19 67 36 20 20 7 57 0 57 21 44 44 22 22 3 45 0 45 23 72 72 72 24 70 70 70 25 25 1 87 0 87 87 26 109 109 109 27 96 96 96 28 41 41 ---------- ---------- ---------- ---------- ---------- ------------ 100 800 703 434
You’ll notice that there’s a substitution variable (m_popular) in this script that I use to identify the “popular values” in the hybrid histogram so that I can report them separately. I’ve set this value to 62 for this example because a quick check of user_tables and user_tab_cols tells me I have 800 rows in the table (user_tables.num_rows) and 13 buckets (user_tab_cols.num_buckets) in the histogram: 800/13 = 61.52. A value is popular only if its repeat count is 62 or more.
This is where you may hit a problem – I certainly did when I switched from testing 18c to testing 12c (which I just knew was going to work – but I tested anyway). Although my data has been engineered so that I get the same “random” data in both versions of Oracle, I got different hybrid histograms (hence my complaint in a recent post.) The rest of this covers 18c in detail, but if you’re running 12c there are a couple of defined values that you can change to get the right results in 12c.
At this point I need to “top and tail” the output because the arithmetic only applies where the histograms overlap, so I need to pick the range from 2 to 25. Then I need to inject a “representative” or “average” count/frequency in all the gaps, then cross-multiply. The average frequency for the frequency histogram is “half the frequency of the least frequently occurring value” (which seems to be identical to new_density * num_rows), and the representative frequency for the hybrid histogram is (“number of non-popular rows” / “number of non-popular values”). There are 800 rows in the table with 22 distinct values in the column, and the output above shows us that we have 5 popular values totally 434 rows, so the average frequency is (800 – 434) / (22 – 5) = 21.5294. (Alternatively we could say that the average selectivities (which is what I’ve used in the next query) are 0.5/100 and 21.5294/800.)
[Note for 12c, you’ll get 4 popular values covering 338 rows, so your figurese will be: (800 – 338) / (22 – 4) = 25.6666… and 0.0302833]
So here’s a query that restricts the output to the rows we want from the histograms, discards a couple of columns, and does the arithmetic:
define m_t2_sel = 0.0302833 define m_t2_sel = 0.0269118 define m_t1_sel = 0.005 break on table_name skip 1 on report skip 1 with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number, endpoint_repeat_count, to_number(null) pop_count from user_tab_histograms where table_name = 'T1' and column_name = 'J1' order by endpoint_value ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number, endpoint_repeat_count, case when endpoint_repeat_count >= &m_popular then endpoint_repeat_count else null end pop_count from user_tab_histograms where table_name = 'T2' and column_name = 'J2' order by endpoint_value ) select f1.value f1_value, f2.value f2_value, nvl(f1.row_or_bucket_count,100 * &m_t1_sel) t1_count, nvl(f2.pop_count, 800 * &m_t2_sel) t2_count, case when ( f1.row_or_bucket_count is not null or f2.pop_count is not null ) then nvl(f1.row_or_bucket_count,100 * &m_t1_sel) * nvl(f2.pop_count, 800 * &m_t2_sel) end product_rp from f1 full outer join f2 on f2.value = f1.value where coalesce(f1.value, f2.value) between 2 and 25 order by coalesce(f1.value, f2.value) ; F1_VALUE F2_VALUE T1_COUNT T2_COUNT PRODUCT_RP ---------- ---------- ---------- ---------- ---------- 2 5 21.52944 107.6472 5 15 21.52944 322.9416 7 15 21.52944 322.9416 10 17 21.52944 366.00048 12 13 21.52944 279.88272 15 15 13 21.52944 279.88272 17 17 11 21.52944 236.82384 19 .5 21.52944 20 20 7 21.52944 150.70608 21 .5 21.52944 22 22 3 21.52944 64.58832 23 .5 72 36 24 .5 70 35 25 25 1 87 87 ---------- ---------- ---------- sum 102 465.82384 2289.41456
There’s an important detail that I haven’t mentioned so far. In the output above you can see that some rows show “product_rp” as blank. While we cross multiply the frequencies from t1.j1 and t2.j2, filling in average frequencies where necessary, we exclude from the final result any rows where average frequencies have been used for both histograms.
[Note for 12c, you’ll get the result 2698.99736 for the query, and 2699 for the execution plan]
Of course we now have to check that the predicted cardinality for a simple join between these two tables really is 2,289. So let’s run a suitable query and see what the optimizer predicts:
set serveroutput off alter session set statistics_level = all; alter session set events '10053 trace name context forever'; select count(*) from t1, t2 where t1.j1 = t2.j2 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last')); alter session set statistics_level = typical; alter session set events '10053 trace name context off'; SQL_ID cf4r52yj2hyd2, child number 0 ------------------------------------- select count(*) from t1, t2 where t1.j1 = t2.j2 Plan hash value: 906334482 ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 108 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 108 | | | | |* 2 | HASH JOIN | | 1 | 2289 | 1327 |00:00:00.01 | 108 | 2546K| 2546K| 1194K (0)| | 3 | TABLE ACCESS FULL| T1 | 1 | 100 | 100 |00:00:00.01 | 18 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 800 | 800 |00:00:00.01 | 34 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."J1"="T2"."J2")
As you can see, the E-Rows for the join is 2,289, as required.
I can’t claim that the model I’ve produced is definitely what Oracle does, but it looks fairly promising. No doubt, though, there are some variations on the theme that I haven’t considered – even when sticking to a simple (non-partitioned) join on equality on a single column.
[…] Hybrid […]
Pingback by Join Cardinality – 5 | Oracle Scratchpad — November 1, 2018 @ 1:34 pm GMT Nov 1,2018 |
[…] Frequency / Hybrid with filter predicates […]
Pingback by Histogram catalogue | Oracle Scratchpad — November 8, 2022 @ 1:07 pm GMT Nov 8,2022 |