Oracle Scratchpad

May 12, 2014

Compression Units – 6

Filed under: Exadata,HCC,Oracle — Jonathan Lewis @ 1:34 pm BST May 12,2014

I received an email recently asking me if I knew how Oracle found specific rows and columns in a compression unit. This is a topic that I’ve spoken about a couple of times, and I’ve published several notes on the blog about it, including an image of a critical slide from one of my presentations, and I was expecting to find some notes somewhere about Oracle catalogues all the bits and pieces. Part 4 of this series lists some of the detail but I was slightly surprised to discover that it made the comment: “and somewhere in the CU there has to be a set of pointers to the first byte for each compressed column” – and this was the bit that my interrogator wanted to know. This posting is the answer – and I’ll start with a dump of the first block of a CU:

tab 0, row 0, @0x30
tl: 8016 fb: --H-F--N lb: 0x0  cc: 1
nrid:  0x01c06484.0
col  0: [8004]
Compression level: 03 (Archive Low)
 Length of CU row: 8004
kdzhrh: ------PC CBLK: 11 Start Slot: 00
 NUMP: 11
 PNUM: 00 POFF: 7884 PRID: 0x01c06484.0
 PNUM: 01 POFF: 15900 PRID: 0x01c06485.0
 PNUM: 02 POFF: 23916 PRID: 0x01c06486.0
 PNUM: 03 POFF: 31932 PRID: 0x01c06487.0
 PNUM: 04 POFF: 39948 PRID: 0x01c06488.0
 PNUM: 05 POFF: 47964 PRID: 0x01c06489.0
 PNUM: 06 POFF: 55980 PRID: 0x01c0648a.0
 PNUM: 07 POFF: 63996 PRID: 0x01c0648b.0
 PNUM: 08 POFF: 72012 PRID: 0x01c0648c.0
 PNUM: 09 POFF: 80028 PRID: 0x01c0648d.0
 PNUM: 10 POFF: 88044 PRID: 0x01c0648e.0
CU header:
CU version: 0   CU magic number: 0x4b445a30
CU checksum: 0x5c173142
CU total length: 93755
CU flags: NC-U-CRD-OP
ncols: 7
nrows: 4361
algo: 0
CU decomp length: 93159   len/value length: 309310
row pieces per row: 1
num deleted rows: 0
 00 00 1f 44 1f 0b 00 00 00 0b 00 00 1e cc 01 c0 64 84 00 00 00 00 3e 1c 01
 c0 64 85 00 00 00 00 5d 6c 01 c0 64 86 00 00 00 00 7c bc 01 c0 64 87 00 00
 00 00 9c 0c 01 c0 64 88 00 00 00 00 bb 5c 01 c0 64 89 00 00 00 00 da ac 01
 c0 64 8a 00 00 00 00 f9 fc 01 c0 64 8b 00 00 00 01 19 4c 01 c0 64 8c 00 00
 00 01 38 9c 01 c0 64 8d 00 00 00 01 57 ec 01 c0 64 8e 00 00 00 4b 44 5a 30
 42 31 17 5c 00 01 6e 3b eb 06 00 07 11 09 00 04 b8 3e 01 00 00 00 00 00 00
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
...  20 repetitions of the previous line of zeros deleted
...  This is mostly the bitmap identifying deleted rows.
...  nrows=4361, so this will be about 4261, so this will be
...  about 4361/8 = 545 bytes of zeros.
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 02 54 00 00 30
 bc 00 00 4a 82 00 00 64 37 00 00 a6 b8 00 00 e9 39 00 01 2b ba e6 c2 eb 8b
 00 00 2e 68 01 40 01 02 00 00 53 f0 78 9c a5 dc 07 57 23 5b 96 25 e0 ee 24

If you look at the symbolic dump there are lots of numbers that you could try to locate in the binary dump labelled START_CU:, for example the CU magic number 4b3445a30 is very obvious at the end of line 37 above (with the bytes in the right order), and the CU checksum of 5c173142 is visible at the start of line 38 (with the bytes in reverse order – no, don’t ask why me there’s a difference).

Bearing in mind how Oracle frequently uses “self-describing” strategies for data we might search for the value that gives the length of the CU as this could well be the start of the compressed data. But there are two CU lengths – the CU total length (93755) and the CU decomp length (93159). To my mind the latter is the “amount of CU requiring decompression” but let’s check for both: 93,755 (dec) = 0x16e3b, and 93,159 (dec) = 0x16be7 .

Take another look at line 38 – the second 4 bytes are: 00 01 6e 3b, the CU total length. Unfortunately we can’t see the CU decomp length anywhere, so let’s change strategy slightly – the different between the two lengths is 93755 – 93159 =  596 = 0x254, and by a convenient coincidence if you look at line 46, five bytes from the end, you find 02 54. Is this really a coincidence ? Let’s count 596 bytes from the CU total length and see where we get to – the answer is : the last 4 bytes of line 46 (e6 c2 eb 8b) This doesn’t seem to help at first sight, but take a look at where we are and what precedes that 4 bytes. There are 7 groups of 4 bytes that look like increasing values – and since I know more about the data than you do (it’s a table with 7 columns, and the last 4 columns are identical across all rows), I’ve extracted those 28 bytes and reformatted them:

                    	Offset		Delta
                    	------		-----
        00 00 02 54 	  596
        00 00 30 bc 	12476		11880
        00 00 4a 82 	19074		 6598
        00 00 64 37 	25655		 6581
        00 00 a6 b8 	42680		17025
        00 00 e9 39 	59705		17025
        00 01 2b ba 	76730		17025
                    	93755	<----

The “Offset” column is what you get by converting from Hex to decimal, and the “Delta” column is the difference between one value and the next. Notice how the last three deltas are the same, and how, if you add the same delta to the last Offset you get a number which is the CU total length. The “02 54″ I found as the difference between the CU total length and the CU decomp length is actually the starting point for the list of pointers to each of the compressed columns, and the repeated 17,025 is the length of my duplicated final four columns. The e6 c2 eb 8b is the first four bytes of the first (compressed) column.

Having worked out that it’s possible to find the list of pointers quite easily from the dump, the next step (left as an exercise to the interested reader) is to work out the algorithm that Oracle uses to locate the list – since the CU decomp length is clearly derived and not actually stored. As a little clue, you might start by taking another look at line 38: bytes 11/12 are 00 07 – the number of columns in the table, and bytes 13/14 are 11 09 – the number of rows in the CU (and that’s the most significant number at this point because it tells you how long the bitmap for deleted rows will be).

March 5, 2014

12c pq_replicate

Filed under: 12c,Exadata,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 5:39 pm BST Mar 5,2014

Another day, another airport lounge – another quick note: one of the changes that appeared in 12c was a tweak to the “broadcast” distribution option of parallel queries. I mentioned this in a footnote to a longer article a couple of months ago; this note simply expands on that brief comment with an example. We’ll start with a simple two-table hash join – which I’ll first construct and demonstrate in

create table t1
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
	rownum				n1,
	lpad(rownum,6,'0')		small_vc,
	lpad(rownum,200,'0')		padding
	generator	v1,
	generator	v2
	rownum <= 1000

create table t2
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
	1 + mod(rownum,10000)			n1,
	lpad(1 + mod(rownum,10000),6,'0')	small_vc,
	lpad(rownum,500,'0')			padding
	generator	v1,
	generator	v2
	rownum <= 20000 ;

-- collect stats, no histograms.

  		leading(t1 t2)
 		parallel(t1 2)
 		parallel(t2 2)
from 	t1, t2
where	t2.n1 = t1.n1
and	t2.small_vc = t1.small_vc

| Id  | Operation               | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
|   0 | SELECT STATEMENT        |          |  1000 |   707K|   135 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10001 |  1000 |   707K|   135 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN            |          |  1000 |   707K|   135 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |          |  1000 |   207K|     4 |  Q1,01 | PCWP |            |
|   5 |      PX SEND BROADCAST  | :TQ10000 |  1000 |   207K|     4 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR |          |  1000 |   207K|     4 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T1       |  1000 |   207K|     4 |  Q1,00 | PCWP |            |
|   8 |     PX BLOCK ITERATOR   |          | 20000 |     9M|   131 |  Q1,01 | PCWC |            |
|   9 |      TABLE ACCESS FULL  | T2       | 20000 |     9M|   131 |  Q1,01 | PCWP |            |

In this plan slave set 2 scans table t1 in parallel and broadcasts the result set to slave set 1 (lines 5 – 7). The significance of the broadcast option is that each slave in slave set 2 sends all the rows it has read to every slave in slave set 1. For a fairly large table with a high degree of parallelism this could be a lot of inter-process communication; the total number of rows passing through the PX message pool is “DOP x number of row filtered from t1″.

After a slave in slave set 1 has receive the whole of the t1 result set it builds an in-memory hash table and starts scanning rowid ranges (PX BLOCK ITERATOR) from table t2, probing the in-memory hash table to effect the join (lines 3,4, 8,9). Since each slave has a copy of the whole result set from t1 it can scan any chunk of t2 and handle the contents locally. Moreover, because slave set 1 isn’t reading its second input from a virtual table it is able to write its output immediately the virtual table (:TQ10001) that feeds the query coordinator with the result (lines 1,2) – we don’t have to do a “hash join buffered” operation and buffer the entire second input before starting to execute the join.

So how does 12c change things. With the same starting data and query, here’s the execution plan:

| Id  | Operation             | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
|   0 | SELECT STATEMENT      |          |  1000 |   707K|   135 |        |      |            |
|   1 |  PX COORDINATOR       |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |  1000 |   707K|   135 |  Q1,00 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN          |          |  1000 |   707K|   135 |  Q1,00 | PCWP |            |
|   4 |     TABLE ACCESS FULL | T1       |  1000 |   207K|     4 |  Q1,00 | PCWP |            |
|   5 |     PX BLOCK ITERATOR |          | 20000 |     9M|   131 |  Q1,00 | PCWC |            |
|   6 |      TABLE ACCESS FULL| T2       | 20000 |     9M|   131 |  Q1,00 | PCWP |            |

Notice, in particular, that we only have one virtual table (or table queue :TQ10000) rather than two – and that’s from a parallel query slave set to the query co-ordinator, parallel to serial; the query only uses one set of parallel query slaves. Until you run the query with rowsource execution statistics enabled and look at the output from v$pq_tqstat it’s not going to be immediately obvious what has happened, but we should see that somehow Oracle is no longer broadcasting the first table even though it’s still doing something in parallel with both tables.

The run-time statistics confirm that we’ve only used one set of slaves, and each slave in the slave set has scanned the whole of table t1. This means each slave can build the full hash table and then go on to read rowid ranges from table t2. We’ve managed to get the benefit of broadcasting t1 (every slave has the whole of t1 so we don’t have to scan and distribute the big table t2 through the PX message pool) but we haven’t had to clone it multiple times through the PX message pool.

Clearly there’s a trade-off here that Oracle Corp. has decided is worth considering. I’m guessing it’s biased towards Exadata where you might run queries with a very high degree of parallelism. In that case the overhead of task switching as large numbers of messages are passed around may (and this is pure supposition) be greater than the added cost of loading the table into the buffer cache (of each instance) and having each slave scan it from there. (Reminder – 11g introduced two “opposite” changed to tablescans: “serial direct reads” and “parallel in-memory scans”.)

There’s one little oddity in this replication – there’s a pair of hints: pq_replicate and no_pq_replicate to control the effect if you think the optimizer is making the wrong choice. I would have guessed that in my example the hint would read: /*+ pq_replicate(t1) */ as it’s table t1 that is read by every single slave. Strangely, though, this is what the outline section of the execution plan showed:

      PQ_REPLICATE(@"SEL$1" "T2"@"SEL$1")
      USE_HASH(@"SEL$1" "T2"@"SEL$1")
      LEADING(@"SEL$1" "T1"@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T1"@"SEL$1")
      OPT_PARAM('_optimizer_cost_model' 'io')

Notice how the hint specifies table t2, not table t1 !


Here’s a little anomaly,  and a generic warning about “optimizer_features_enable”: I found that if I used the hint /*+ optimizer_features_enable(‘′) */ in 12c I could still get the pq_replicate() hint to work. Unfortunately there are a few places where the hint (or parameter) isn’t guaranteed to take the optimizer code backwards the full 100%.

September 13, 2013

Quiz Night

Filed under: Exadata,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 6:32 pm BST Sep 13,2013

Here’s a little quiz about Bloom filtering. There seem to be at least three different classes of query where Bloom filters can come into play – all involving hash joins: partition elimination, aggregate reduction on non-mergeable aggregate views, and parallelism.

This quiz is about parallel queries – and all you have to do is work out how many Bloom filters were used in the following two execution plans (produced by, and where they were used.


October 5, 2012


Filed under: Exadata,Infrastructure,Oracle,Performance,redo — Jonathan Lewis @ 1:04 pm BST Oct 5,2012

There’s never enough time to read everything that’s worth reading, so even though Guy Harrison’s blog is one of the ones worth reading I find that it’s often months since I last read it. Visiting it late last night, I found an interesting batch of articles spread over the last year about the performance of SSD – the conclusions may not be what you expect, but make sure you read all the articles or you might end up with a completely misleading impression:

Don’t forget to read the comments as well. For other notes Guy has written about SSD, here’s a URL for his SSD tag.

August 19, 2012

Compression Units – 5

Filed under: CBO,Exadata,HCC,Indexing,Oracle — Jonathan Lewis @ 6:02 pm BST Aug 19,2012

The Enkitec Extreme Exadata Expo (E4) event is over, but I still have plenty to say about the technology. The event was a great success, with plenty of interesting speakers and presentations. I was particularly keen to hear  Frits Hoogland’s comments  on Exadata and OLTP, Richard Foote on Indexes, and Maria Colgan’s comments on how Oracle is making changes to the optimizer to understand Exadata a little better.

All three presentations were interesting – but Maria’s was possiby the most important (and entertaining). In particular she told us about two patches for, one current and one that is yet to be released (unfortunately I forgot to take  note of the patch numbers – ed: but they’ve been supplied by readers’ comment below).

August 7, 2012

Compression Units – 4

Filed under: Exadata,HCC,Oracle — Jonathan Lewis @ 5:24 pm BST Aug 7,2012

Following up a suggestion from Kerry Osborne that I show how I arrived at the observation I made in an earlier posting about the size of a compression unit, here’s a short note to show you what I did. It really isn’t rocket science (that’s just a quick nod to NASA and Curiosity – the latest Mars rover).

Step 1: you can access rows by rowid in Oracle, so what happens when you try to analyze rowids on Exadata for a table using HCC ? I created a table with the option “compress for archive high” and then ran the following query:

July 27, 2012

Compression Units – 3

Filed under: Exadata,HCC,Oracle — Jonathan Lewis @ 5:08 pm BST Jul 27,2012

For those who have read the previous posting of how I engineered an Exadata disaster and want to reproduce it, here’s the script I used to generate the data.

July 23, 2012

Compression Units – 2

Filed under: CBO,Exadata,HCC,Indexing,Oracle — Jonathan Lewis @ 4:41 pm BST Jul 23,2012

When I arrived in Edinburgh for the UKOUG Scotland conference a little while ago Thomas Presslie, one of the organisers and chairman of the committee, asked me if I’d sign up on the “unconference” timetable to give a ten-minute talk on something. So I decided to use Hybrid Columnar Compression to make a general point about choosing and testing features. For those of you who missed this excellent conference, here’s a brief note of what I said.

July 20, 2012

Compression Units

Filed under: Exadata,HCC,Oracle — Jonathan Lewis @ 6:04 am BST Jul 20,2012

If you’re starting to work with Exadata you need to work out what how much impact you can have by doing the right (or wrong, or unlucky) sorts of things with the technology. Fortunately there are only a few special features that really need investigation: Hybrid Columnar Compression (HCC), Offloading (and related Smart Scan stuff) and Storage Indexes. These are the features that will have the biggest impact on the space taken up by your data, the volume of data that will move through your system as you load and query it, and the amount of CPU it takes to do something with your data.

There are other features that are important, of course, such as the features for affecting parallel execution, and the options for resource management, but the three I’ve listed above are the core of what’s (currently) unique to Exadata. In this note I’m just going to make a few comments about how Oracle implements HCC, and what side-effects this may have.

October 5, 2011

HCC – 2

Filed under: Exadata,Infrastructure,Oracle — Jonathan Lewis @ 12:07 pm BST Oct 5,2011

Just a little follow-up to my previous note on hybrid columnar compression. The following is the critical selection of code I extracted from the trace file after tracing a run of the advisor code against a table with 1,000,000 rows in it:

October 4, 2011


Filed under: Exadata,Infrastructure,Oracle — Jonathan Lewis @ 11:56 am BST Oct 4,2011

Hybrid Columnar Compression is one of the big features of Exadata that can make fairly dramatic differences to the amount of space it takes to store your data. But how do you find out what’s going on under the covers if you haven’t got an Exadata machine in your garage ?

Here’s a simple starting point that occurred to me a couple of days ago after the product manager (or some such) pointed out that there was no need to make an Exadata emulator available to anyone because all you needed was the compression advisor which you could trust because it actually compressed a sample of your data to see how well it could compress.

September 14, 2010


Filed under: Exadata,Infrastructure — Jonathan Lewis @ 6:58 pm BST Sep 14,2010

Inevitably there’s still excitement and uncertainty in the air about Exadata – and/or the Database Machine. For ease of reference and a jolly good read I’ve started collecting a few blog postings about it:

Jason Arneil’s Blog


Kerry Osborne

Kevin Closson

Tanel Poder

Uwe Hesse

Arup Nanda

Oracle Official Links

The Rubric Theme. Blog at


Get every new post delivered to your Inbox.

Join 4,895 other followers