In an earlier article about investigating the state of an index in detail I supplied a piece of SQL that would analyse an index (no, not using the Analyze command) and summarise the number of entries in each leaf block that currently held any entries at all. Here’s a sample of the type of output it produced:
ROWS_PER_BLOCK BLOCKS TOTAL_ROWS RUNNING_TOTAL -------------- ---------- ---------- ------------- 1 8706 8706 8706 2 4830 9660 13536 3 2953 8859 16489 4 1648 6592 18137 5 997 4985 19134 6 628 3768 19762 7 300 2100 20062 8 162 1296 20224 9 87 783 20311 10 52 520 20363 ... 227 100 22700 31871 228 111 25308 31982 229 84 19236 32066 230 97 22310 32163 231 77 17787 32240 232 61 14152 32301 233 54 12582 32355 234 529 123786 32884 ---------- ---------- sum 32884 1887595
This is useful of course as it tells you a lot about the way that space is used in the index – but it only tells you how many blocks are in any given state, it doesn’t tell you whereabouts in the index those blocks are. It would be really nice if you could “draw a picture” of the index, showing where the gaps were and where it was densely packed.
I was thinking about this a few weeks ago and suddenly realised that it was a simple problem. Oracle gives you exactly the tools you need. I was travelling back from a client site to London and the train ride was only an hour long so it gave me just enough time to generate a quick test to demonstrate the principle. ***
We start with an article on the treedump command, which I published a little while ago. This dumps a tracefile describing your index, one line per block of the index in correct index order.
----- begin tree dump branch: 0x40035a 4195162 (0: nrow: 3, level: 2) branch: 0x40e2c0 4252352 (-1: nrow: 312, level: 1) leaf: 0x40035b 4195163 (-1: nrow: 240 rrow: 240) leaf: 0x40215f 4202847 (0: nrow: 218 rrow: 218) leaf: 0x40035c 4195164 (1: nrow: 444 rrow: 444) leaf: 0x40035d 4195165 (2: nrow: 444 rrow: 444) ... etc ...
This is the key to the problem; all you have to do is read the trace file and you’ve got all you need to know – except that it’s one line per leaf block, and that’s a lot of reading for a large index and that will make it hard to spot interesting patterns.
But it’s a trace file – and we can read trace files from SQL by making use of external tables; [update: but see this note for a more modern option for reading trace fles] and we can collapse data down to a more reasonable size by using analytic functions like the ntile() function. So here’s a little script that puts a few simple pieces together and comes up with a fairly useful tool.
rem rem Script: read_treedump.sql rem Dated: Feb 2010 rem Author: Jonathan Lewis rem Purpose: Using external files to analyze an index. rem rem Last tested: rem 10.2.0.3 rem Not tested rem 11.2.0.1 rem 11.1.0.7 rem 9.2.0.8 rem Not relevant rem 8.1.7.4 -- no external tables rem rem Notes: rem See read_trace.sql for notes on reading trace files. rem rem Required Privileges: rem grant create any directory to the user. rem connect test_user/test start setenv set timing off execute dbms_random.seed(0) set echo on drop table t1; set echo off begin begin execute immediate 'purge recyclebin'; exception when others then null; end; begin dbms_stats.set_system_stats('MBRC',8); dbms_stats.set_system_stats('MREADTIM',26); dbms_stats.set_system_stats('SREADTIM',12); dbms_stats.set_system_stats('CPUSPEED',800); exception when others then null; end; begin execute immediate 'begin dbms_stats.delete_system_stats; end;'; exception when others then null; end; begin execute immediate 'alter session set "_optimizer_cost_model"=io'; exception when others then null; end; end; / -- ------------------------ -- Create a table and index -- ------------------------ define m_size=100000 create table t1 nologging as with generator as ( select --+ materialize rownum id from dual connect by rownum <= 10000 ) select rownum id, lpad(rownum,10,'0') vc_id from generator v1, generator v2 where rownum <= &m_size ; create index t1_i1 on t1(vc_id) nologging; -- --------------------------------------------------------- -- Create a "FIFO" pattern of space in the index leaf blocks -- --------------------------------------------------------- delete from t1 where id between 1 and 0.5 * &m_size and mod(id,100) != 0 ; commit; delete from t1 where id between 0.5 * &m_size and 0.75 * &m_size and mod(id,50) != 0 ; commit; delete from t1 where id between 0.75 * &m_size and 0.875 * &m_size and mod(id,25) != 0 ; commit; delete from t1 where id between 0.875 * &m_size and 0.9375 * &m_size and mod(id,12) != 0 ; commit; delete from t1 where id between 0.9375 * &m_size and 0.96875 * &m_size and mod(id,6) != 0 ; commit; -- ------------------- -- Dump the index tree -- ------------------- column object_id new_value m_object_id select object_id from user_objects where object_name = 'T1_I1' ; alter session set events 'immediate trace name treedump level &m_object_id'; -- ------------------------------- -- Now sort out the trace filename -- It's easier in 11g -- ------------------------------- column spid new_value m_spid select trim(pro.spid) spid from v$process pro, v$session ses where pro.addr = ses.paddr and ses.sid = ( select sid from v$mystat where rownum = 1 ) ; define m_filename = 'd10g_ora_&m_spid..trc' prompt Trace file name: &m_filename -- ------------------------------------------ -- Set up an external table to read the trace -- ------------------------------------------ define m_file_dir = 'c:\oracle\admin\d10g\udump' define m_log_dir = 'c:\working'; create or replace directory ext_tab as '&m_file_dir'; create or replace directory log_dir as '&m_log_dir'; drop table ext; create table ext(text_line varchar(4000)) organization external ( type oracle_loader default directory ext_tab access parameters ( records delimited by newline characterset us7ascii discardfile log_dir:'read_trace_dis.txt' badfile log_dir:'read_trace_bad.txt' logfile log_dir:'read_trace_log.txt' fields rtrim reject rows with all null fields ( text_line (1:4000) char ) ) location ('&m_filename') ) reject limit unlimited ; -- ----------------------------------------------------- -- -- Read the treedump in order, selecting only lines with -- "leaf" in them. -- branch: 0x140020a 20972042 (0: nrow: 307, level: 1) -- leaf: 0x140020b 20972043 (-1: nrow: 326 rrow: 3) -- leaf: 0x140020c 20972044 (0: nrow: 326 rrow: 3) -- -- Use instr() to get the text following the last ": ", -- and select rownum to get an ordering on the lines. -- -- Use substr() and length() to trim off the final ")" -- then turn the result into a number and select ntile() -- to collect the numbers into ordered batches. -- -- Finally sum the batches - I've used 50 tiles because -- that's a page length of numbers, you could use 100 or -- more if you selected the result into a spreadsheet to -- graph the results. -- -- ----------------------------------------------------- spool read_treedump set pagesize 60 select section, sum(leaf_row_count) row_count from ( select ntile(50) over (order by rn) section, to_number( substr( text_line, 1, length(text_line)-1 ) ) leaf_row_count from ( select rownum rn, substr( text_line, instr(text_line,':',-1) + 2 ) text_line from ext where instr(text_line,'leaf') != 0 ) ) group by section order by section ; spool off
If you feel like making more of it, you could probably turn it into a pipe-lined function taking the index schema and name as an input, together with the number of tiles, and returning the data set. You’d need to customise the code to identify the trace file location, and probably have a fixed setup for directories rather than creating them on the fly in the function. Once you’ve done this you could then set up a spreadsheet with a ‘select from table_function()’ call and pull the data straight into a chart.
Here’s the output I got from this demonstration case. As you can see, it shows that the index starts with a great long tail that is thinly populated and that a large fraction of the data is packed into the last 10% of the index, with most of that being in the right-hand 4% of the index.
SECTION ROW_COUNT ---------- ---------- 1 22 2 23 3 23 4 23 5 23 6 22 7 23 8 20 9 19 10 20 11 19 12 20 13 20 14 19 15 20 16 19 17 20 18 19 19 20 20 20 21 19 22 20 23 19 24 20 25 31 26 39 27 39 28 40 29 39 30 39 31 39 32 39 33 39 34 39 35 39 36 39 37 40 38 71 39 78 40 78 41 79 42 78 43 78 44 115 45 163 46 163 47 201 48 326 49 1368 50 1874 ---------- sum 5665
Reminder: a treedump walks the index in order, reading one block at a time – this can be a lot of work and take a lot of time. There are also versions where it does a full symbolic block dump for every leaf in the index, so test it on a very small index before trying to dump a large index.
*** Footnote:
Like many of the scripts I’ve published on this blog, the SQL shown here is “good enough” for me to use because I know how it works and I won’t have to use it often. It’s not intended to be a bullet-proof program in a “productised” state.
Update (Dec 2010)
(See comment 7 below) each line of the treedump is about 75 bytes long – so the whole trace file will be in the region of (75 * leaf_blocks / 1048576)MB in size. Make sure that you can dump a trace file that size before you start. Check the parameter max_dump_file_size.
Wonderful and useful.
Comment by fli — March 9, 2010 @ 5:22 pm GMT Mar 9,2010 |
[…] Leaf Block scanner: a labour-intensive analysis of leaf block usage for a simple B-tree index. Drawing an Index: An example of reading a treedump and using an analytic function to “draw a picture” of […]
Pingback by Simple scripts « Oracle Scratchpad — March 10, 2010 @ 10:29 pm GMT Mar 10,2010 |
[…] Drawing a picture of the index […]
Pingback by Index analysis « Oracle Scratchpad — March 10, 2010 @ 10:44 pm GMT Mar 10,2010 |
[…] 4-How to automate to read index treedump via external table ? Jonathan Lewis-treedump-2 […]
Pingback by Blogroll Report 05/02/2010 – 12/03/2010 « Coskan’s Approach to Oracle — April 23, 2010 @ 1:07 am BST Apr 23,2010 |
[…] Treedump – 2 https://jonathanlewis.wordpress.com/2010/03/07/treedump-2/ […]
Pingback by reading this week | Sidney's blog — June 18, 2010 @ 11:41 am BST Jun 18,2010 |
[…] Lewis @ 6:45 pm UTC Jul 1,2010 Some time ago I published a little script that showed you how to read an index treedump and summarise the leaf block usage in the logical order of the index leaf blocks – allowing […]
Pingback by Index branches « Oracle Scratchpad — July 1, 2010 @ 6:46 pm BST Jul 1,2010 |
Hi Jonathan,
I am trying taking tree dump for one of my non partitioned PK index which is sized at 289 GB. Its a 3 column composite index having compression enabled. I read the bottom part of your post saying, treedump will take lot of time. Its running for last 1 hr (index is having dba_indexes.leaf_blocks=37226645) and i am expecting more time. While i am fine with larger treedump time, i dont understand why the tracefile is not getting updated. For first 5 min it got 20MB and for last 55 min it is sized at same.
However, i am observing ‘db file sequential read’ for the session which is taking treedump and i believe, this that is due to scanning of all leaf blocks one by one as you have mentioned in blog. My table holds financial records for last 5 years and it is never purged. There is no ETL but around 20-30 concurrent sessions are Inserting rows (conventional path insert) all the time (except weekend) through services. Last time when i did the index rebuild, i gained lot of gain in index size. This time, i want to analyze index structure to find WHERE-ABOUT of fragmented entries within index structure. Index ini_trans is 2 and PCT_FREE is set to 0 when it was created and rebuilt last time.
I will update the status of my treedump test as soon as i have tracefile completed.
Comment by Bhavik Desai — December 5, 2010 @ 8:59 am GMT Dec 5,2010 |
Bhavik,
It looks as if you have a max_dump_file_size set to the equivalent of 20MB. You don’t say which version of Oracle you are using, but the default in 10g is “unlimited” – so perhaps it has been deliberately restricted on your system.
Each line of the trace will be about 75 bytes, and you have about 37M blocks, so you will get a trace file of about 2.5GB – 3.0 GB from the call. And, as you have observed about 37M single block read requests (although you may find that sometimes you get “db file parallel reads”.
If your trace file size is limited, you could change it, before you start the dump, with:
I have added a comment to both articles highlighting this point.
Have you seen my note and script on estimating the “correct” size of an index ?
Comment by Jonathan Lewis — December 5, 2010 @ 12:12 pm GMT Dec 5,2010 |
[…] Dec 2010: (See this comment on a related post): Each line of a treedump takes about 75 bytes, so you need to be able to generate a tracefile of […]
Pingback by treedump « Oracle Scratchpad — December 5, 2010 @ 12:27 pm GMT Dec 5,2010 |
Thanks Jonathan,
As always,you are right in identifying the problem.I forgot to set tracefile size to unlimited. I did that and restarted treedump process again 555 min ago and its going on. Thanks for providing estimated trace file size. I have 45GB space available on my dump destination. I hope treedump will not lock anything and its just doing sequential or parallel read index block reads.
Thought of giving you first few and last (from trace till now) few lines of trace file…
My intention here is to identify how much space i am going to gain after doing index online rebuild and to understand the Index structure. (I am open to consider COALESCE or Shrink commands, but intention of index re-org is to gain storage.)
(This is not an IOT index and table is a non-partitioned heap table).
Index columns are :
EXTERNAL_REFERENCE_ID_TYPE,ACCTG_EVENT_TYPE_ID,EXTERNAL_REFERENCE_ID with data-type pf VARCHAR2(50),NUMBER(4),VARCHAR2(256) respectively.
EXTERNAL_REFERENCE_ID column is being populated by an Oracle sequence. (INCREMENT_BY 10, cache 1000,last_number=89297318104)
Comment by Bhavik Desai — December 5, 2010 @ 8:57 pm GMT Dec 5,2010 |
Bhavik,
I don’t have time for a full analysis – but here are a few brief comments.
The number of rows per leaf block at the start is consistent with the average column lengths of the index. Later on it is half the size – suggesting that you’ve hit the “ITL explosion” problem ( https://jonathanlewis.wordpress.com/2009/09/19/index-itls/ ). The fact that the leaf blocks oscillate between 50 and 100 rows per block also suggests that rows have been appearing out of sequence in a way that does not allow for “back-fill”, making your problem worse.
Your sequence cache size looks too small, and you seem to be converting (by padding with leading zeros, perhaps) a 10 digit number into a 36 character string – which probably accounts for a huge fraction of your index size. It’s probably not feasible to change this – but it looks as if it needs changing.
Given the very small number of distinct value for the first two columns in this index I think you should probably aim to rebuild it with compress 2 to reduce repetitions of the leading two columns in each leaf block.
You might want to consider hash partitioning this index at least a few ways – eight or 16 , perhaps – given the impact of concurrency on your code. Alternatively it may be a nice idea to recreate the table as list partitioned on external_reference_id_type. Obviously there are lots of things about your application that I can’t know which might make both these ideas totally inappropriate.
There is a recent bug fix for the ITL problem, by the way, see: https://jonathanlewis.wordpress.com/2009/07/28/index-explosion/#comment-37834
Comment by Jonathan Lewis — December 5, 2010 @ 9:28 pm GMT Dec 5,2010 |
[…] treedump 2: Another look at tree dumps. […]
Pingback by Index ITLs « Oracle Scratchpad — December 5, 2010 @ 9:31 pm GMT Dec 5,2010 |
Thanks Jonathan for your time and valuable inputs.
I forgot to mention (and you asked earlier) that my database is running on 11.1.0.7.0 – 64bit Production (RHEL 5.3).
I will definitely look at the suggestion you have given w.r.t partitioning options, increasing sequence cache ,padding issues with sequence# and compression on first 2 columns.
My treedump trace is complete and here is output of sql that you listed:
I have also ran index_efficiency_3.sql
May i ask (definitely, respecting your time and bandwidth) for further analysis on what else (apart from ITL issue) have caused such large fragmentation ?
Comment by Bhavik Desai — December 6, 2010 @ 1:48 pm GMT Dec 6,2010 |
Thanks Jonathan…
I have confirmed that it is an ITL explosion case. I dumped the last leaf block and found that it has ITC=127. Interestingly, i dumped some random leaf blocks which were listed UP in the treedump (5k lines up than last leaf block) and found that ICT is 148..
Giving you leaf block dump :
I am bit skeptic about space utilization and guessing that there is something else that i am not considering while calculating leaf block size.
TOTAL_SIZE_IN_LEAF_BLOCK=(Block overhead+(INI_TRANS * 24)+ Max.Usable_Space Z (Z=USED_SPACE X+AVAILABLE_SPACE Y)
Z=kdxlebksz (4528) –>Max.Usable_Space
Y=kdxcoavs (2254) —>AVAILABLE_SPACE
Therefore USED_SPACE, X=(Z-Y)=2274 (2274 is space occupied by all rowids in leaf block)
Block over_head for B-tree index=9 bytes
SO,TOTAL_SIZE_IN_LEAF_BLOCK=(9+3352+4528)=7889 which is 303 Bytes less than 8192…Where did 303 bytes go ?
My PK index is a 3 column compressed index and compression is set for first 2 columns. Since it is a composite index, i am sure that 50-50 leaf block split have also occurred despite the fact that last column is a sequence based.
Here is an order of index population :
1) Table is created without any data
2) Pk index is created on that table with PCTFREE=0,INI_TRANS=2 AND compression set on first 2 columns
3) Populated 90M rows from an existing table (INSERT APPEND PARALLEL)
4) Started the application service to load new data into table and thus into index (Conventional path INSERT-~30-50 sessions inserting at a time)
Application(Service) has observed ‘low commit time’ couple of times in past that suggests that,# of uncommitted transactions might have gone up during that time and thus INI_TRANS is increased up-to 148… (This is just my guess from application low commit time issue)
Comment by Bhavik Desai — December 8, 2010 @ 8:54 am GMT Dec 8,2010 |
Actually,let me take it back….I am counting Overhead only one time (9 bytes), but it is actually (9*kdxconro-#of rowids in that leaf block)=(9*100=900)
SO,TOTAL_SIZE_IN_LEAF_BLOCK=(900+3352+4528)=8780…this is now more than 8192 !
i don’t know where i am making mistake !
Comment by Bhavik Desai — December 8, 2010 @ 10:49 am GMT Dec 8,2010 |
Hi Jonathan,
I tried creating a HASH partitioned IOT on last column (128 hash partitions). But partition movement can cause significant outage to application service which is loading records in IOT (Conventional path insert). Inserts are happening 24*7 and whenever i do alter table iot move partition x; it will lock that particular partition and hence, i would either get error (ORA-00054: resource busy ) while partition movement or my insert will wait till the partition where it is trying to insert is moved.
Comment by Bhavik Desai — December 14, 2010 @ 2:03 pm GMT Dec 14,2010 |
Bhavik,
I have had a look at your results – and will make a few comments as soon as I have a little free time. IN the meantime – is the index compressed on the first two columns, and how much variation appears in the lengths of the values used in the three columns ?
Comment by Jonathan Lewis — December 8, 2010 @ 9:35 pm GMT Dec 8,2010 |
Thank you very much Jonathan…
Yes…index is indeed compressed on first two columns and i believe,compression ration is very good.
If you look at the leaf block dump that i provided, you would find that all 100 rowids in that block is sharing the same PREFIX row (prc 100). I believe, your question to me is : to find variation in length of three columns in DIFFERENT leaf block dumps….am i right ?…If yes..then…here it is :
Let me know what else i can provide for better analysis ?
Many thanks in advance…
Comment by Bhavik Desai — December 9, 2010 @ 6:45 am GMT Dec 9,2010 |
Bhavik,
I’m running at least 24 hours behind your comments – I hadn’t even got as far as reading your blockdump when I wrote my last comment, so I didn’t notice you’d answered my question long before I asked it.
My initial comments about compression and the left-padding of the sequence number was based on the indications of ca. 200 rows per blocks. Your treedump showing more than 400 rows per leaf block suggested that the left-padding was wrong, or you were compressing, or both.
I’ve just spent a couple of minutes looking at the block dump (and your description of the initial load), which probably answers all the questions I might want to ask about how you’ve implemented this index. I don’t think there’s anything surprising about what your seeing – and I may find time to write up some comments this evening (or over the weekend).
I might extract some of the material and turn it into a blog item rather than leaving a chain of comments.
Comment by Jonathan Lewis — December 9, 2010 @ 8:03 am GMT Dec 9,2010 |
Thanks you Sir…
I really got excited when i see your reply, whether directly on your blog or on OTN ! In past,We did talk personally when you came to India (Hyderabad) couple of months back on FREELIST-32 for MSSM tablespace to avoid BBW and NLNVL function to handle NULL value. Your suggestions were awesome and it did bring lot of performance enhancements in my critical DW.
Comment by Bhavik Desai — December 9, 2010 @ 8:16 am GMT Dec 9,2010 |
Bhavik,
A few summary comments on the information supplied.
Point 1:
Your observation that this index will not be doing 90/10 splits even though it involves a sequence is correct. The sequence is not the leading column – and that is a pre-requisite for 90/10 splits.
Given the number of rows – 4.48 Billion – and number of distinct values for the leading pair of columns (EXTERNAL_REFERENCE_ID_TYPE 36 values, ACCTG_EVENT_TYPE_ID 124 values) there could be 4,464 leading combinations with about 1,000,000 rows each. In this type of scenario the index will, in effect, consist of 4,463 sections doing 50/50 splits with only the top section able to do 90/10 splits. In effect it will behave roughly like 4,464 separate indexes jammed together (and, in passing, the compression effects will be extremely good – the one prefix per block you’ve seen.)
Point 2:
There are some contradictions in the information you’ve supplied. The average column length of EXTERNAL_REFERENCE_ID is 36 – but you’ve pointed out that this is a sequence number (increment 10, cache 1000, current high value at 11 digits) that has been stored as a varchar2. In the block dump you supplied we can see that the value is stored as a simple “to_char” of a number, taking 11 bytes – if this is always the case, how does the average column length get to be 36 ? (In passing, the values in the block you’ve dumped are increase by 1,000 each row – that’s a bit of a coincidence I’d want to investigate.)
In one of your other listings you show the “row #0” lengths at 6, 19, 20, and 36 bytes – the first might be (must be) deleted rows, the middle pair show rows that look appropriate for storing 10 and 11 digit numbers as character strings, the highest would require a 27 character string. So (a) you seem to have been deleting or updating key values and (b) you seem to have used (or be using) more than one strategy for storing the EXTERNAL_REFERENCE_ID. (And none of your examples show the row#0 length of 44 that would be consistent with an average column length of 36).
Point 3:
You loaded 90 Million rows into the table/index initially with pctfree 0, and you mention you have rebuilt it in the past. On the load, and on the rebuild, you will have packed the index to 100%, with no spare ITL entries. As soon as the concurrent processes started to run they would cause the top blocks of each section of the index to do 50/50 block splits (apart from the very top block which would do 90/10).
Assuming some combinations of (EXTERNAL_REFERENCE_ID_TYPE, ACCTG_EVENT_TYPE_ID) are more popular than others, some sections of the index would see a higher level of concurrency than the others – so they would be more likely to accumulate more ITL entries on the leaf node split more rapidly. (This explains why you can get “randomly” varying results for ITC values across the index – it’s behaving like 4,464 separate indexes).
Rows with the longer values for the row#0 (ca. 40) figures will pack a block up to about 200 entries. Rows with the shorter values (ca. 20) will pack a block up to about 400 entries. Blocks which have seen deletes (row#0 = 6) may show enough variation to account for the blocks you reported with nearly 500 rows – especially if this was a very small fraction of the blocks.
So we should not be surprised to see: blocks with 200 or 400 rows (the initial load / rebuild effect), blocks with 100 or 200 rows (50/50 leaf node split), and blocks with 50 to 100 rows (leaf node split plus worst case ITL bug).
Obviously, since I don’t know what you are really doing with the EXTERNAL_REFERENCE_ID, there may be lots of noise surrounding this pattern. Similarly, since the ITL bug introduces extra ITL entries over time at a rate that is loosely dependent on “local” concurrency and a bit of bad luck, there may be further noise surrounding the figures.
Point 4:
Possible action points:
a) Find out what is really going on with the EXTERNAL_REFERENCE_ID and see if you can sanitise it.
b) Work on getting the patch from Oracle for the ITL bug
c) If space is really significant, rebuild the index.
d) If space is not at a premium, rebuild the index ONLY IF you can see that the physical I/O on the index is a significant contributor to any performance problems (See “Segments by Physical Reads” in the AWR as a guide.) Note – if all your queries are about ‘recent’ EXTERNAL_REFERENCE_IDs then your system may work perfectly well if it can cache the last few blocks from each of the 4,464 index sections to avoid physical I/O on the index; so even if the whole index is wasting space, the way you are using the index may mean it is not causing an I/O problem.
e) To minimise hot spots and to make future rebuilds easier (if you think it’s likely to be necessary), and if your important queries always include predicates like ‘EXTERNAL_REFERENCE_ID = {constant}’ rebuild the index as a hash-partitioned index on EXTERNAL_REFERENCE_ID – perhaps as many as 128 partitions. If you’re on 11g you might look at list/hash composites (if that’s legal, I’d have to check it myself) – listing on external references types, hashing on EXTERNAL_REFERENCE_ID. You might even consider rebuilding the table as a partitioned table with local indexing.
f) Create (by writing a script – see my blogs on frequency histograms) a histogram on the first two columns of the index: you may find the optimizer working better if it knows fairly precisely the distribution of data for those values. If you’re on 11g and the number of combinations is a long way off the 4,464 I’ve been talking about then create some extended stats on the combination to tell Oracle the number of distinct combinations – it may help in some cases (again, write a program after an initial check.) Note that your “num_buckets” on the two columns was less than the num_distinct – this means that you have some values which don’t appear in the histogram, which may make the optimizer produce some silly cardinality estimates.
g) Any time you do rebuild the index, keep a log of WHEN you took it, and how big it was after the rebuild: any future estimates of how inefficient the index might be (before you do the expensive checks) needs to know what baseline you started from.
h) I note that the block dump you supplied showed the EXTERNAL_REFERENCE_ID_TYPE as “TransferReceiveActionID”. Because of the index compression this is not wasting storage in the index – but if you really are pushed for space you could consider storing this column as a code number in the table – and introduce lists of constraining values, views, and a function-based index, to recreate access paths that use the text content.
Comment by Jonathan Lewis — December 12, 2010 @ 7:42 pm GMT Dec 12,2010 |
Thanks a ton Jonathan…
I really appreciate that you REMEMBER to post comments.Many thanks for the time you spent on analyzing and detailed write-up…
Here are my answers…
Point 1: Totally agree with you.
Point 2: (There are some contradictions in the information you’ve supplied.)
You are absolutely correct and i think ,the other day,i prematurely wrote how we populate the last col. I checked with application users and the developers and found that, application service collects row data from multiple (11+) source dbs and ‘EXTERNAL_REFERENCE_ID’ is PK-(Being populated through sequence in source db-NOT MY DB)….We are not using any sequence for this table. Apart from this, there were many character based unique values that are pre-existing in source EXTERNAL_REFERENCE_ID which are sized at 36 characters.(i believe this is due to some legacy code)
That confirms that your assumption on strategy (b) is correct.
Point 3: Got that….Thanks very much for clearing that doubt.
Point 4:
(a)I am working with the developers to find out what exactly is going on?
(b) Yes..My db is running on 11.1.0.7.0. I am opening SR to get PATCH-SET for ILT issue… (BTW ..Do you remember any patch-set or BUG# ?)
(c)Yes..space-reclamation is important but i can delay the re-build process for next 30 days.
(d)As of now, there is no complaint w.r.t. PIO. But i think, we are going to have IO issues as well going forward. This PK is basically used for idem-potency check. So, for every new record that is to be inserted, a trigger will check unique combination of these 3 columns in base table (through PK)…if no record found than, trigger will populate the IOT table and thus index will get populated (Just any fyi…IOT is just a name of a non partitioned table..its not an Index Org. Table)
Since,PK height is 5, PK access for idem-potency check requires 5 IO to check uniqueness within index b-tree structure…and if rebuild is releasing significant amount of space and hence lowing index b-level, then IO is going to be improved by a huge factor.
(e)Partitioning is definitely a great solution you are giving. Last weekend i worked on it.Actually, i am trying creating a real 128-HASH partitioned INDEX ORGANIZED table on EXTERNAL_REFERENCE_ID column having local PK on given 3 columns. But it think below two restrictions are going to create some noise going ahead…(though i have not tested that)..BTW…LIST-HASH partitioning is not supported for IOT in my 11g version.
1)A partitioned IOT rebuild (move partition) cannot run in PARALLEL. Parallel option is not supported for IOT.
2)Since i am creating HASH partitions,application service can write record into any of the HASH partitions amongst 128 partitions.So, if i am moving partitions (for re-org purpose….if required…),then MOVE statement can lock that partition and hence application cannot insert until partition movement is complete.
(f)i will check histogram option too..i will update the blog with histogram information soon.
(g) Point noted..
(h)Again a great finding. Many thanks…i am going to check with developers to evaluate feasibility to accommodate this change.
Comment by Bhavik Desai — December 13, 2010 @ 8:10 am GMT Dec 13,2010 |
Height 5 is not necessarily a big deal – especially if you are not seeing any physical I/O problems relating to this index. There’s a fair chance that whether the height is 4 or 5 your I/O threat will only come from LEAF blocks, and the excess work will simply be the one extra LIO from the higher root block. Still, you may find that hash partitioning 128 ways saves you that one LIO and spreads out a little contention. You may even find that a larger number of partitions (256) is enough to drop you to height 3 – but you always have to consider the possible side effects of the number of partitions on less precise queries.
I don’t want to see the histogram figures – that was just an option for you to investigate.
When you rebuild, you should probably do it at PCTFREE 0 – if it’s only the legacy data that is non-sequential then it needs to be packed; and “4,464” index sections are only going to split on their right hand side, so they might as well start at 100% packed – you’re not “back-filling” old leaf blocks.
One reason for looking at partitioning to small chunks (ca. 2GB at 128) is that you can rebuild each partition individually if it really needs it. With an IOT you can do “alter table XXX move online”. I think you can do this to a single partition in 11g (but you’ll have to check – it wasn’t possible in earlier versions.
Comment by Jonathan Lewis — December 13, 2010 @ 7:25 pm GMT Dec 13,2010
Hi Jonathan,
Thank you for this nice way to investigate the state of an index.
I used it many times to diagnose mainly FIFO index issues.
I added two statistics in the report :
– leaf count per section
– empty leaf count per session (leafs with rrow = 0)
This was useful to show the empty blocks on the “left side” of FIFO indexes after significantly reducing the retention period.
Lately I found an index with nearly 50% empty blocks which is still claiming new extents.
The empty blocks are quite evenly distributed in the 85% leftmost sections.
My understanding is that leaf blocks with rrow = 0 are empty and are candidate to be reused.
Did I miss something ?
Thank you for your help.
Comment by Vincent — January 7, 2013 @ 5:52 pm GMT Jan 7,2013 |
Vincent,
Thanks for the comment. That sounds like a nice extension to the query, making it easier to see the scale of the problem.
You’re correct that empty blocks (rrow=0) should be re-used before new blocks are allocated but there seem to be various timing problems (possibly only related to very large deletes) with updating bitmap blocks under ASSM – I’ve just run a little test deleting all the rows from a few index leaf blocks, and the “emptyness” dropped to the range 25% – 50% rather than 0% – 25%, which means that at a high level Oracle can’t think the leaf blocks are empty.
You might note that rrow drops to zero BEFORE the commit, so you could also imagine some problems relating to the timing of the commit, delayed block cleanout etc. even when using freelist management. You may find that Oracle manages to go through a cycle of using a few empty blocks, adding a few new ones, using a few empty, and so on.
Comment by Jonathan Lewis — January 12, 2013 @ 11:45 am GMT Jan 12,2013 |