Oracle Scratchpad

November 10, 2020

Quiz time

Filed under: Infrastructure,Oracle,redo — Jonathan Lewis @ 9:28 am GMT Nov 10,2020

Here’s a fun little item that I’ve just rediscovered from 2016.

There’s a fairly commonly cited query that aggregates and tabulates the first_time column from the v$log_history view by hour of day, reporting the hour across the page and the date down the page. It often appears in the responses to the question “how big should my online redo log files be?” but I’m not going to supply the query here because it’s irrelevant, instead I’m going to show you a line of the output it produced (on a real production system) then ask the important question that might prompt you decide to resize your log files:

DAY        00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23

--------- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---

06-MAY-16   4   3   4   5   3   2   0   1   3   4   1   5  12   8   2   1   7   3   0   1   5   5   2   4


The important question is this:

  • when was most time recorded for events relating to log file switches?

I’ll supply the answer some time tomorrow. The only comment I’ll make in advance is that there’s a huge history of variation in the way you (and Oracle) might view redo log file switches, and it would be easy for two highly competent DBAs to have dramatically different view points if all you gave them was a set of numbers and no context.

Answer

There is no way you could have deduced the right answer from the line of numbers above, and it’s gratifying to see that all the comments understood the rules of this “no-win” game. The correct answer to the question was

The point at which the most significant redo waits appeared was from 20:58 to 21:02 when there were 10 log file switches in just over 3 minutes.

There are plenty of arguments that you might put forward to justify guessing some other time period, and I think they were all explained (or, rather, explained away) in the comments.

One person volunteered to highlight the “obvious” answer – which is the 12:00 – 13:00 interval which recorded the single highest number. Clearly there may have been a bit of a peak in the rate of redo generation around that interval (from 11:30 to 13:30, say) but that doesn’t mean the load on the system was high enough to cause any waits.

There are a couple of places where the count shows zero, and that might be because there was no redo generated, but it might be because the log writer got stuck on (e.g.) a CI enqueue wait and everything else stopped in a wait for logfile switch.

Then, of course, if the actual workload peak had occurred just a few minutes later another of the intervals would have been recording 12 switches leaving the less experienced DBA a choice of time periods to investigate.

There are two problems with this report: first there’s the fundamental understanding – the rate at which redo is generated doesn’t tell you anything about how much time is lost due to contention related to redo. Secondly the report has an inherent scoping issue – as one of the comments put it – there’s a big difference between 12 logfile switches in 5 minutes and 12 logfile switches in 60 minutes. (Almost a prescient comment since the right answer was 10 logfile switches in just over 3 minutes – without showing 10 logfile switches in an hour).

As with so many of the “popular” monitoring queries that have been around for the last quarter of a century – the only use of a report like this is to give you a little histoary so that you can look at the last couple of weeks and wonder what happened around 9:00 last night that was different from usual, because that spike hasn’t been there before.

5 Comments »

  1. Impossible to answer this question realiably with provided data.
    If somebody is interested in time spent by sessions waiting on certain things, it’s best to go and look for that time spent, not the number of log switches per hour.
    Extreme examples: 12 switches 5 minutes apart each vs 4 in 30 seconds. First may be nothing, second may cause some issues (or not).

    Comment by Timur Akhmadeev — November 10, 2020 @ 2:04 pm GMT Nov 10,2020 | Reply

  2. The obvious answer is “12”, so I’ll just assume that’s wrong…

    Comment by Jeff — November 10, 2020 @ 2:22 pm GMT Nov 10,2020 | Reply

  3. My gut feeling is that this is not the right data to determine the “most time recorded”, as you write. The data here is just the number of events/h. To determine the “most time recorded”, and to use this information as a prompt for resizing the log files, better data would be the sum of the diff between FIRST_TIME and some other time/h. The precise detail, I do not know.

    Comments from a developer/analytical perspective, not a DBA perspective, but learning …..

    Comment by /* Michael D O'Shea */ (@MichaelDOShea) — November 10, 2020 @ 4:42 pm GMT Nov 10,2020 | Reply

  4. Basing solely on the above data, I think it is hard to answer the question “when was most time recorded for events relating to log file switches?”
    Following are the wait events associated with log file switch in 12c
    – log file switch (archiving needed)
    – log file switch (checkpoint incomplete)
    – log file switch (clearing log file)
    – log file switch (private strand flush incomplete)
    – log file switch completion
    – log switch/archive
    – switch logfile command
    There are 2 important things that need to complete before a redo log file can be reused
    – Checkpointing needs to move beyond the changes recorded in the redo log
    – Archiving the redo log has completed
    Of these 2, archiving though put is easy to estimate as it is just a sequential write. (Also assuming that all the log switches are full log switches, ie switches that happened when the current redo log file is fiiled up completely)
    Checkpoint work related to a redo log fill up is not consistent and hard to estimate. The redo log changes could pertain to few or many data blocks and the checkpointing load depends on that factor. Also the incremental checkpointing parameters control the throughput of these writes.
    We can see that max log switches (assuming they are full log switches) happened between 12 and 13.
    But if the generation of these logs has been at a steady rate of say one log switch every 7-8 minutes then the archiving and checkpointing might not have been a problem.
    If on the other hand there are 12 log groups in the thread and 12 log switches happened from 12:55 to 1PM and the next 8 log switches happened between 1PM – 1:05 PM. Then archiving and checkpointing might have had trouble catching up with this load and the log file switch waits might happen within the time period 13-14.
    The same logic applies to other time periods as well.
    So the number of log switches per hour is not enough information to make any solid conclusions

    Comment by srivenu — November 10, 2020 @ 6:03 pm GMT Nov 10,2020 | Reply

  5. 6 and 18 have 0 log switches. Therefore I think they have the highest time recorder for log switch as at least it did not complete within the hour.

    Comment by Sanjeev — November 12, 2020 @ 8:47 am GMT Nov 12,2020 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.