So far in this series I’ve written about the way that the optimizer estimates cardinality for an equijoin where one end of the join has a frequency histogram and the other end has a histogram of type:

It’s now time to look at a join where the other end has a * height-balanced* histogram. Arguably it’s not sensible to spend time writing about this since you shouldn’t be creating them in 12c (depending, instead, on the hybrid histogram that goes with the

*), and the arithmetic is different in 11g. However, there still seem to be plenty of people running 12c but not using the*

**auto_sample_size***and that means they could be generating some height-balanced histograms – so let’s generate some data and see what happens.*

**auto_sample_size**rem rem Script: freq_hist_join_04a.sql rem Author: Jonathan Lewis rem Dated: Oct 2018 rem Purpose: rem rem Last tested rem 18.3.0.0 rem 12.2.0.1 rem 12.1.0.2 rem 11.2.0.4 Different results rem drop table t2 purge; drop table t1 purge; set linesize 156 set trimspool on set pagesize 60 set feedback off execute dbms_random.seed(0) create table t1( id number(6), n04 number(6), n05 number(6), n20 number(6), j1 number(6) ) ; create table t2( id number(8,0), n20 number(6,0), n30 number(6,0), n50 number(6,0), j2 number(6,0) ) ; insert into t1 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 4) + 1 n04, mod(rownum, 5) + 1 n05, mod(rownum, 20) + 1 n20, trunc(2.5 * trunc(sqrt(v1.id*v2.id))) j1 from generator v1, generator v2 where v1.id <= 10 -- > comment to avoid WordPress format issue and v2.id <= 10 -- > comment to avoid WordPress format issue ; insert into t2 with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum, 20) + 1 n20, mod(rownum, 30) + 1 n30, mod(rownum, 50) + 1 n50, 28 - round(abs(7*dbms_random.normal)) j2 from generator v1 where rownum <= 800 -- > comment to avoid WordPress format issue ; commit; prompt ========================================================== prompt Using estimate_percent => 100 to get height-balanced in t2 prompt ========================================================== begin dbms_stats.gather_table_stats( ownname => null, tabname => 'T1', method_opt => 'for all columns size 1 for columns j1 size 254' ); dbms_stats.gather_table_stats( ownname => null, tabname => 'T2', estimate_percent => 100, method_opt => 'for all columns size 1 for columns j2 size 20' ); end; /

As in earlier examples I’ve created some empty tables, then inserted randomly generated data (after calling the * dbms_random.seed(0)* function to make the data reproducible). Then I’ve gathered stats, knowing that there will be 22 distinct values in

*so forcing a height-balanced histogram of 20 buckets to appear.*

**t2**When we try to calculate the join cardinality we’re going to need various details from the histogram information, such as bucket sizes, number of distinct values, and so on, so in the next few queries to display the histogram information I’ve captured a few values into SQL*Plus variables. Here’s the basic information about the histograms on the join columns * t1.j1* and

*:*

**t2.j2**column num_distinct new_value m_t2_distinct column num_rows new_value m_t2_rows column num_buckets new_value m_t2_buckets column bucket_size new_value m_t2_bucket_size select table_name, column_name, histogram, num_distinct, num_buckets, density from user_tab_cols where table_name in ('T1','T2') and column_name in ('J1','J2') order by table_name ; select table_name, num_rows, decode(table_name, 'T2', num_rows/&m_t2_buckets, null) bucket_size from user_tables where table_name in ('T1','T2') order by table_name ; column table_name format a3 heading "Tab" break on table_name skip 1 on report skip 1 with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) row_or_bucket_count, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ) select f1.* from f1 union all select f2.* from f2 order by 1,2 ; Tab COLUMN_NAME HISTOGRAM NUM_DISTINCT NUM_BUCKETS DENSITY -------------------- -------------------- --------------- ------------ ----------- ---------- T1 J1 FREQUENCY 10 10 .005 T2 J2 HEIGHT BALANCED 22 20 .052652266 Tab NUM_ROWS BUCKET_SIZE -------------------- ---------- ----------- T1 100 T2 800 40 Tab VALUE ROW_OR_BUCKET_COUNT ENDPOINT_NUMBER --- ---------- ------------------- --------------- T1 2 5 5 5 15 20 7 15 35 10 17 52 12 13 65 15 13 78 17 11 89 20 7 96 22 3 99 25 1 100 T2 1 0 0 14 1 1 17 1 2 18 1 3 19 1 4 20 1 5 21 2 7 22 1 8 23 1 9 24 2 11 25 2 13 26 3 16 27 2 18 28 2 20

As you can see, there is a * frequency* histogram on

*reporting a cumulative total of 100 rows; and the histogram on*

**t1***is a*

**t2***histogram of 20 buckets, showing 21, 24, 25, 26, 27 and 28 as popular values with 2,2,2,2,3 and 2 endpoints (i.e. buckets) respectively. You’ll also note that the*

**height-balanced***histogram has 21 rows with row/bucket 0 showing us the minimum value in the column and letting us know that bucket 1 is not exclusively full of the value 14. (If 14 had been the minimum value for the column as well as an end point Oracle would not have created a bucket 0 – that may be a little detail that isn’t well-known – and will be the subject of a little follow-up blog note.)*

**t2**Let’s modify the code to join the two sets of hisogram data on data value – using a full outer join so we don’t lose any data but restricting ourselves to values where the histograms overlap. We’re going to follow the idea we’ve developed in earlier postings and multiply frequencies together to derive a join frequency, so we’ll start with a simple full outer join and assume that when we find a real match value we should behave as if the height-balanced buckets (* t2*) where the bucket count is 2 or greater represent completely full buckets and are popular values..

I’ve also included in this query (because it had a convenient full outer join) a column selection that counts how many rows there are in * t1* with values that fall inside the range of the

*histogram but*

**t2****don’t**match a popular value in

*.*

**t2**column unmatch_ct new_value m_unmatch_ct column product format 999,999.99 break on report skip 1 compute sum of product on report with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ), join1 as ( select f1.value t1_value, f2.value t2_value, f1.frequency t1_frequency, f2.frequency t2_frequency, sum( case when f2.frequency > 1 and f1.frequency is not null then 0 else f1.frequency end ) over() unmatch_ct, f2.frequency * &m_t2_bucket_size * case when f2.frequency > 1 and f1.frequency is not null then f1.frequency end product from f1 full outer join f2 on f2.value = f1.value where coalesce(f1.value, f2.value) between 2 and 25 -- coalesce(f1.value, f2.value) between &m_low and &m_high order by coalesce(f1.value, f2.value) ) select * from join1 ; T1_VALUE T2_VALUE T1_FREQUENCY T2_FREQUENCY UNMATCH_CT PRODUCT ---------- ---------- ------------ ------------ ---------- ----------- 2 5 99 5 15 99 7 15 99 10 17 99 12 13 99 14 1 99 15 13 99 17 17 11 1 99 18 1 99 19 1 99 20 20 7 1 99 21 2 99 22 22 3 1 99 23 1 99 24 2 99 25 25 1 2 99 80.00 ----------- sum 80.00

We captured the bucket size (* &m_bucket_size*) for the

*histogram as 40 in the earlier SQL, and we can see now that in the overlap range (which I’ve hard coded as 2 – 25) we have three buckets that identify popular values, but only one of them corresponds to a value in the frequency histogram on*

**t2***, so the*

**t1***column shows a value of 1 * 2 * 40 = 80. Unfortunately this is a long way off the prediction that the optimizer is going to make for the simple join. (Eventually we’ll see it’s 1,893 so we have a lot more rows to estimate for).*

**Product**Our code so far only acounts for items that are popular in both tables. Previous experience tells us that when a popular value exists only at one end of the join predicate we need to derive a contribution to the total prediction through an *“average selectivity”* calculated for the other end of the join predicate. For frequency histograms we’ve seen that *“half the number of the least frequently occuring value”* seems to be the appropriate frequency estimate, and if we add that in we’ll get two more contributions to the total from the values 21 and 24 which appear in the height-balanced (* t2*) histogram as popular but don’t appear in the frequency (

*) histogram. Since the lowest frequency in*

**t1***is 1 this would give us two contributions of 0.5 * 2 (buckets) * 40 (bucket size) viz: two contributions of 40 bringing our total to 160 – still a serious shortfall from Oracle’s prediction. So we need to work out how Oracle generates an*

**t1***“average frequency”*for the non-popular values of

*and then apply it to the 99 rows in*

**t2***that haven’t yet been accounted for in the output above.*

**t1**To calculate the *“average selectivity”* of a non-popular row in * t2* I need a few numbers (some of which I’ve already acquired above). The total number of rows in the table (NR), the number of distinct values (NDV), and the number of popular values (NPV), from which we can derive the the number of distinct non-popular values and the number of rows for the non-popular values. The model that Oracle uses to derive these numbers is simply to assume that a value is popular if its frequency in the histogram is greater than one and the number of rows for that value is

*“frequency * bucket size”*.

The first query we ran against the * t2* histogram showed 6 popular values, accounting for 13 buckets of 40 rows each. We reported 22 distinct values for the column and 800 rows for the table so the optimizer assumes the non-popular values account for (22 – 6) = 16 distinct values and (800 – 13 * 40) = 280 rows. So the selectivity of non-popular values is (280/800) * (1/16) = 0.021875. This needs to be multiplied by the 99 rows in

*which don’t match a popular value in*

**t1***– so we now need to write some SQL to derive that number.*

**t2**We could enhance our earlier full outer join and slot 0.5, 99, and 0.021875 into it as “magic” constants. Rather than do that though I’m going to write a couple of messy queries to derive the values (and the low/high range we’re interested in) so that I will be able to tweak the data later on and see if the formula still produces the right answer.

column range_low new_value m_low column range_high new_value m_high column avg_t1_freq new_value m_avg_t1_freq column new_density new_value m_avg_t2_dens with f1 as ( select endpoint_value ep_val, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select endpoint_value ep_val, endpoint_number ep_num, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ) select max(min_v) range_low, min(max_v) range_high, min(min_f)/2 avg_t1_freq, max(new_density) new_density from ( select min(ep_val) min_v, max(ep_val) max_v, min(frequency) min_f, to_number(null) new_density from f1 union all select min(ep_val) min_v, max(ep_val) max_v, null min_f, (max(ep_num) - sum(case when frequency > 1 then frequency end)) / ( max(ep_num) * (&m_t2_distinct - count(case when frequency > 1 then 1 end)) ) new_density from f2 ) ; RANGE_LOW RANGE_HIGH AVG_T1_FREQ NEW_DENSITY ---------- ---------- ----------- ----------- 2 25 .5 .021875

This query finds the overlap by querying the two histograms and reporting the lower high value and higher low value. It also reports the minimum frequency from the frequency histogram and divides by 2, and calculates the number of non-popular rows divided by the total number of rows and the number of distinct non-popular values. (Note that I’ve picked up the number of distinct values in * t2.j2* as a substituion variable generated by one of my earlier queries.) In my full script this messy piece of code runs before the query that showed I showed earlier on that told us how well (or badly) the two histograms matched.

Finally we can use the various values we’ve picked up in a slightly more complex version of the full outer join – with a special row added through a * union all* to give us our the estimate:

break on report skip 1 compute sum of product on report with f1 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T1' and column_name = 'J1' ), f2 as ( select table_name, endpoint_value value, endpoint_number - lag(endpoint_number,1,0) over(order by endpoint_number) frequency, endpoint_number from user_tab_histograms where table_name = 'T2' and column_name = 'J2' ), join1 as ( select f1.value t1_value, f2.value t2_value, f1.frequency t1_frequency, f2.frequency t2_frequency, f2.frequency * case when f2.frequency > 1 and f1.frequency is not null then f1.frequency when f2.frequency > 1 and f1.frequency is null then &m_avg_t1_freq end * &m_t2_bucket_size product from f1 full outer join f2 on f2.value = f1.value where coalesce(f1.value, f2.value) between &m_low and &m_high order by coalesce(f1.value, f2.value) ) select * from join1 union all select null, &m_avg_t2_dens, &m_unmatch_ct, &m_t2_rows * &m_avg_t2_dens, &m_t2_rows * &m_avg_t2_dens * &m_unmatch_ct from dual ; T1_VALUE T2_VALUE T1_FREQUENCY T2_FREQUENCY PRODUCT ---------- ---------- ------------ ------------ ----------- 2 5 5 15 7 15 10 17 12 13 14 1 15 13 17 17 11 1 18 1 19 1 20 20 7 1 21 2 40.00 22 22 3 1 23 1 24 2 40.00 25 25 1 2 80.00 .021875 99 17.5 1,732.50 ----------- sum 1,892.50

It remains only to check what the optimizer thinks the cardinality will be on a simple join, and then modify the data slightly to see if the string of queries continues to produce the right result. Here’s a starting test:

set serveroutput off alter session set statistics_level = all; alter session set events '10053 trace name context forever'; alter session set tracefile_identifier='BASELINE'; select count(*) from t1, t2 where t1.j1 = t2.j2 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last')); alter session set statistics_level = typical; alter session set events '10053 trace name context off'; COUNT(*) ---------- 1327 PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------------------------------------ SQL_ID f8wj7karu0hhs, child number 0 ------------------------------------- select count(*) from t1, t2 where t1.j1 = t2.j2 Plan hash value: 906334482 ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 41 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 41 | | | | |* 2 | HASH JOIN | | 1 | 1893 | 1327 |00:00:00.01 | 41 | 2545K| 2545K| 1367K (0)| | 3 | TABLE ACCESS FULL| T1 | 1 | 100 | 100 |00:00:00.01 | 7 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 800 | 800 |00:00:00.01 | 7 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."J1"="T2"."J2")

The * E-rows* for the hash join operation reports 1893 – and a quick check of the 10053 trace file shows that this is 1892.500000 rounded – a perfect match for the result from my query. I’ve modified the data in various ways (notably updating the

*table to change the value 25 (i.e. the current maximum value of*

**t1***) to other, lower, values) and the algorithm in the script seems to be sound – for 12c and 18c. I won’t be surprised, however, if someone comes up with a data pattern where the wrong estimate appears.*

**j1**### Don’t look back

Upgrades are a pain. With the same data set and same statistics on 11.2.0.4, running the same join query between * t1* and

*, here’s the execution plan I got:*

**t2**SQL_ID f8wj7karu0hhs, child number 0 ------------------------------------- select count(*) from t1, t2 where t1.j1 = t2.j2 Plan hash value: 906334482 ----------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem | ----------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 12 | | | | | 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 12 | | | | |* 2 | HASH JOIN | | 1 | 1855 | 1327 |00:00:00.01 | 12 | 2440K| 2440K| 1357K (0)| | 3 | TABLE ACCESS FULL| T1 | 1 | 100 | 100 |00:00:00.01 | 6 | | | | | 4 | TABLE ACCESS FULL| T2 | 1 | 800 | 800 |00:00:00.01 | 6 | | | | ----------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("T1"."J1"="T2"."J2")

Notice that the * E-rows* value is different. The join cardinality algorithm seems to have changed in the upgrade from 11.2.0.4 to 12c. I haven’t quite figured out how to get to the 11g result, but I seem to get quite close

**most of the time**by making a simple change to the final query that I used to predict the optimizer’s estimate. In the

*expression that chooses between the actual*

**case***frequency and the*

**t1.j1***“average frequency”*don’t choose, just use the latter:

case when f2.frequency > 1 and f1.frequency is not null -- then f1.frequency -- 12c then &m_avg_t1_freq -- 11g when f2.frequency > 1 and f1.frequency is null then &m_avg_t1_freq end *

As I modified the * t1* row with the value 25 to hold other values this change kept producing results that were exactly 2, 2.5, or 3.0 different from the execution plan

*– except in one case where the error was exactly 15.5 (which looks suspiciously like 17.5: the*

**E-Rows***“average frequency in t2”*minus 2). I’m not keen to spend time trying to work out exactly what’s going on but the takeaway from this change is that anyone upgrading from 11g to 12c may find that some of their queries change plans because they happen to match the type of example I’ve been working with in this post.

In some email I exchanged with Chinar Aliyev, he suggested three fix-controls that might be relevant. I’ve added these to * an earlier posting* I did when I first hit the anomaly a few days ago but I’ll repeat them here. I will be testing their effects at some point in the not too distant future:

14033181 1 QKSFM_CARDINALITY_14033181 correct ndv for non-popular values in join cardinality comp. (12.1.0.1) 19230097 1 QKSFM_CARDINALITY_19230097 correct join card when popular value compared to non popular (12.2.0.1) 22159570 1 QKSFM_CARDINALITY_22159570 correct non-popular region cardinality for hybrid histogram (12.2.0.1)

[…] It looks as if the difference comes mostly from a coding error in 11g that has been fixed in 12c – I couldn’t find an official bug or fix_control that matched, though. More on that later on this week. […]

Pingback by Upgrades – again | Oracle Scratchpad — August 5, 2020 @ 1:45 pm BST Aug 5,2020 |

[…] Frequency / Height-balanced (obsolete) with filter predicates […]

Pingback by Histogram catalogue | Oracle Scratchpad — November 8, 2022 @ 1:07 pm GMT Nov 8,2022 |

Jonathan,

I realize this blog is several years old but I will attempt a question.

Thank you for the multiple blogs post about the join cardinality and histograms. I’ve read the five “join cardinality” series and have been testing implementations on our systems where potentially needed.

Oracle SE v19.18

While running tests of adding histograms, gathering stats, running test queries, and examining execution plans I noticed what seemed like an anomaly with cardinality estimate in subsequent runs of gathering stats and executing the same query. At first I thought I was imagining it but I went back and explicitly ran the sequence 100 times, i.e. gather stats, check for histogram, run query, get execution plan.

Even before explicitly placing a histogram on the column I can see the stats on this one column fluctuate. (The column has over 6000 NDV.) The stats run can even vary such that some runs get a hybrid histogram with 254 buckets while others do not. Some plans with a hybrid histogram of 254 buckets have a cardinality estimate of 23 (same as without a histogram) while others have estimates of 400 or 500 or more. (The actual for this column and bind variable is 321.) This made me think that the column was doing a block sampling to get such variance.

Reading Nigel’s blog confirmed that speculation.

https://blogs.oracle.com/optimizer/post/how-does-auto-sample-size-work-in-oracle-database-12c

After Oracle 12 the gather table stats seemingly does a single pass over the entire table reading all blocks and uses a hashing technique to get the table stats and basic column stats. This is extended to frequency and top-frequency histograms. However, for hybrid histograms it does a sampling.

Seemingly that sampling can cause the stats to vary.

Since my NDV for this column is 6000+ I attempted to create a histogram of 2048.

I continue to get a hybrid histogram but because of the sampling the estimate can vary from 70 to 682 in my tests. Again, the actual is 321.

This doesn’t seem too bad but there is a bit of the butterfly effect when this table is far down in the join order and the errors propagate up to other joins. The underestimates in particular can lead to NL joins where none of that should be happening.

My next step was to change the ESTIMATE_PERCENT to 100 to eliminate the sampling issue.

This has the effect of creating a HEIGHT BALANCED histogram.

The cardinality estimate becomes 278 (actual 321) which is good. But even better is that it is stable. We get that estimate for 100/100 runs.

But now the question is, should we do that?

In part 2 of the series you state, “… but we’re going to start with an example where our data won’t allow Oracle to create a frequency histogram and a height-balanced histogram can become more of a liability than a benefit; and we’ll describe a strategy (that you will no longer need in 12c) for working around this limitation.”

It seems the recommendation is to use the default AUTO SAMPLE size. But the auto sample size with the hybrid histogram is a bit unstable if the data has even a bit of skew. In this particular scenario the column data is not always seen as skewed since it does not always get a histogram. Yet there can be significant variability.

To make it worse (…or more interesting), this is just one column on one table of several tables and columns that are routinely joined together. Fluctuating stats in any of the tables could cause unstable and inconsistent performance.

I can provide stats of the runs I am referring to but I did not see a mechanism for formatting and did not want to dump a bunch of unformatted stats here.

I am leaning toward the ESTIMATE_PERCENT=100 for specific cases that can be identified where it has an impact. But I do not want to disregard the warnings and recommendations of not straying from the default AUTO SAMPLE size and removing access to the newer statistics objects and modelling which I hope only gets better.

Interested in your opinion.

Thank you

Andrew

Comment by Andrew Markiewicz — March 17, 2023 @ 8:23 pm GMT Mar 17,2023 |

Andrew,

Thanks for the comment. You mentioned “part 2 of the series”, I’ll just point out that this refers to a short series I wrote for redgate about histograms – listed near the end of my histogram catalogue: https://jonathanlewis.wordpress.com/2018/10/15/faking-histograms/

The conclusion to that article starts with a comment that is still true 10 years later:

“Sometimes you have to do some fairly subtle things to get the best out of Oracle, and when you’re dealing with histograms you’re in an area where you’re most likely to need some custom mechanisms to get things right – but probably only for a few special cases.”If generating a height-balanced histogram at 100% is what it takes for your particularly data set then that’s what you have to do. In fact I think there’s an article on the blog where I’ve reported doing exactly that on a client site – though I’ve also said something about locking the stats so that the histogram doesn’t get deleted or changed by accident, and documenting the need for special processing when new stats are needed.

An alternative, which I’ve also suggested in a few places in the blog, is to write code to fake a histogram that is a good description of the data in that column so that you don’t have to waste time gathering at 100% for that one column.

At 6,000+ distinct values, though, I’m wondering whether you even need a histogram or whether you’re using one because Oracle has automatically decided that the column might benefit from one.

Regards

Jonathan Lewis

Comment by Jonathan Lewis — March 22, 2023 @ 11:19 am GMT Mar 22,2023 |

Jonathan,

Thanks for the feedback.

“…and documenting the need for special processing when new stats are needed.”

This is the line of thinking I am planning to propose. Currently I have set one table preference for one histogram on one system and then had a sinking feeling this is not the route to take. Too easy to lose the history of “who, what, where, when, why” these changes were made if they are individual DBA actions.

Might be better to have the application use a utility system to administer and document these statistics preferences. Additionally, we have several customer installations of the software with different data profiles in the system (i.e. some have lots of data in one place while others have very little; depends on the needs of the customer). That makes me concerned that one size may not fit all and different settings may need to be on option as necessary. We’ll start with setting the same preferences for all instances and tweak where necessary.

I’ve read the blog on faking a histogram. We may incorporate that into the statistics utility as an option.

Regarding a histogram on the 6000+ NDV column. When gathering stats, Oracle occasionally recognizes that it might need a histogram on that column. I’ve run tests of gathering stats 100 times in a row and maybe 5-10% of the time it would create a histogram with 254 buckets. And then, since it was a hybrid, the sampling can get unpredictable results. Mostly it would just get the same stats and underestimate by 10-15x (e.g. 29 vs actual 321 with an equality predicate). For this column, that gets compounded by many joins and later makes the estimate off by several OOM. I more discovered this one in the investigation of adding histograms for the purposes of better join cardinalities. But it could also be used in an equality predicate with a bind, so it is also helpful for that.

Regardless, I do get better and consistent estimates with the height-balanced histogram of 2048 buckets and ESTIMATE_PERCENT=100. So far, the stats gathering has not been that bad using 100% for this table.

Thanks for everything.

Andrew

Comment by Andrew Markiewicz — March 24, 2023 @ 2:55 pm GMT Mar 24,2023 |