Some time ago, I was asked to take a quick look at an application that had to handle a lot of LOBs. The LOB-specific part of the application was actually quite simple – contracts were stored as LOBs – but only for occasional visual reference; all the “structured” information from the contract was extracted and stored in relational tables. Some time after a contract had expired, the LOB could be deleted to reclaim space (in theory). Historically, the client had purged a load of LOBs from time to time, but didn’t have a deliberate house-keeping task to do the job on a regular basis.
My task was to check how much benefit they could get by improving their house-keeping routines, and to suggest any another efficiency enhancements that I could spot.
As a starting point I whipped together the following piece of SQL, just to see what was in the critical table. The purpose of the SQL was to show:
- How many LOBs were currently stored in the main table
- How many LOBs would be stored if we purged 14 days after expiry.
- How many LOBs would be stored if we purged 7 days after expiry.
- The approximate space requirements (in 8KB blocks) in all three cases.
The code is followed by the first few lines of output.
column c_size format 9,999 heading "Blocks" column c_ct format 99,999,999 column c_vol format 999,999,999 column f_ct format 99,999,999 column f_vol format 999,999,999 column w_ct format 99,999,999 column w_vol format 999,999,999 select c_size, count(*) c_ct, c_size * count(*) c_vol, f_size, f_size * count(*) f_vol, w_size, w_size * count(*) w_vol from ( select c_size, case when expiry_date > trunc(sysdate - 14) then c_size else null end f_size, case when expiry_date > trunc(sysdate - 7) then c_size else null end w_size from ( select expiry_date, ceil(dbms_lob.getlength(contract_text)/8000) c_size from contracts ) ) group by c_size, f_size, w_size order by c_size, f_size, w_size ; Blocks C_CT C_VOL F_SIZE F_VOL W_SIZE W_VOL ------ ----------- ------------ ------ ------------ ------ ------------ 1 5,123,187 5,123,187 1 5,123,187 1 5,123,187 1 91,600 91,600 1 91,600 1 4,979,383 4,979,383 2 37,414 74,828 2 74,828 2 74,828 2 12,437 24,874 2 24,874 2 643,971 1,287,942 3 15,652 46,956 3 46,956 3 46,956 3 4,011 12,033 3 12,033 3 228,683 686,049 4 5,333 21,332 4 21,332 4 21,332 4 96 384 4 384 4 8,119 32,476
The output from the query is neither elegant nor entirely self-explanatory, but one of the problems of checking production systems is that you don’t want to hit them too hard. In this case the main table was over 10GB (the LOB segment was about 0.25 TB) so I didn’t want to query it very often – especially since the query had to be a tablescan – so I was aiming for efficiency rather than pretty when I created the SQL.
I’ve used a simple scan of the table to extract the size (in units of roughly 8KB) and expiry date of each LOB. I’ve then in-lined this and wrapped it with a query that uses the current date to decide whether to include the LOB as less than a fortnight old (f_size) or less than a week old (w_size).
Once I’ve got the split, I just sum the LOB space.
Because of the way I’ve defined f_size and w_size, when I group by c_size, f_size, and w_size I end up with (up to) three rows for each possible LOB size – the order by clause then ensures that “fortnight” columns eliminate the third row of each group, and the “week” columns eliminate both the second and third.
The upshot of this report is that you can see two key features:
- A very large fraction of the LOBs use less than 8KB each
- If you only allow contracts to exist for 2 weeks after expiry, the data size drops by about 50%.
It just happened that the client had built his LOBs in a tablespace with a 16KB block size. Moving the LOB segment to a tablespace with an 8KB block size would automatically save 128GB.
A routine that ran every night to delete all LOBs older than 13 days would save another 64GB (Alternatively, running a routine every Saturday night to delete LOBs older than 7 days would have a similar effect).
Other options to consider:
Would switching to a tablespace with a 4KB block size reduce the storage further – you’d have to run the query again using 4,000 instead of 8,000 to get a clue about that (I did, and the answer was no).
Should the LOB be defined as ‘out of line storage’ only – that depends on what you do with the rest of the data in the table, but in this case the answer was yes.
Should the LOB be declared CACHE, CACHE READ, or NOCACHE; should it be declared LOGGING or NOLOGGING. Again, it depends how you use the data. In this client’s case, LOGGING was an obvious requirement, the caching decision was arguable.
Given that we were about to move the LOB back to a standard block size, caching the LOB could have a negative impact on the rest of the caching (even, for fairly subtle reasons, if you cache it in the recycle pool); on the other hand, setting it NOCACHE means all reads and writes to the LOB are “direct path” which could have an intermittent, and random, impact on the performance of other processes.
In the end we stuck with an 8KB block size, caching in the recycle pool, but it did occur to me that an alternative strategy would have been to declare the LOB in a tablespace with a 4KB block size (to give it a totally isolated cache) then set the LOB chunk size to 8KB so that Oracle could treat the LOB almost as if it were stored in a tablespace with an 8KB block size.
Update Oct 2009:
I was on a customer site a few weeks ago working on a couple of problems that they had asked me to investigate – including the apparently huge waste of space in their LOBs. As a quick check I wrote a little script (a variant of the one above) to calculate the sizes of the LOBs in a sample from each of the critical tables. The results were staggering – apparently 90% of their LOB allocation was empty space – and then I realised two things: the getlength() function counts the number of characters in a CLOB, and they were using a multibyte character set (and multibyte character sets get stored as a fixed 2-byte character set a CLOB is stored in the database). So I had to modify my script to change the calls to getlength() to “2 * getlength()”. (They were still wasting a few hundred gigabytes of LOB space – but not as much as the initial run suggested).