Oracle Scratchpad

October 13, 2014

Memory

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 5:24 pm BST Oct 13,2014

On a client site recently, experimenting with a T5-2 – fortunately a test system – we decided to restart an instance with a larger SGA. It had been 100GB, but with 1TB of memory and 256 threads (2 sockets, 16 cores per socket, 8 threads per core) it seemed reasonable to crank this up to 400GB for the work we wanted to do.

It took about 15 minutes for the instance to start; worse, it took 10 to 15 seconds for a command-line call to SQL*Plus on the server to get a response; worse still, if I ran a simple “ps -ef” to check what processes were running the output started to appear almost instantly but stopped after about 3 lines and hung for about 10 to 15 seconds before continuing. The fact that the first process name to show after the “hang” was one of the Oracle background processes was a bit of hint, though.

Using truss on both the SQL*Plus call and on the ps call, I found that almost all the elapsed time was spent in a call to shmatt (shared memory attach); a quick check with “ipcs – ma” told me (as you might guess) that the chunk of shared memory identified by truss was one of the chunks allocated to Oracle’s SGA. Using pmap on the pmon process to take a closer look at the memory I found that it consisted of a few hundred pages sized at 256MB and a lot of pages sized at 8KB; this was a little strange since the alert log had told me that the instance was allocating about 1,600 memory pages of 256MB (i.e. 400GB) and 3 pages of 8KB – clearly a case of good intentions failing.

It wasn’t obvious what my next steps should be – so I bounced the case off the Oak Table … and got the advice to reboot the machine. (What! – it’s not my Windows PC, it’s a T5-2.) The suggestion worked: the instance came up in a few seconds, with shared memory consisting of a few 2GB pages, a fair number of 256MB pages, and a few pages of other sizes (including 8KB, 64KB and 2MB).

There was a little more to the advice than just rebooting, of course; and there was an explanation that fitted the circumstances. The machine was using ZFS and, in the absence of a set limit, the file system cache had at one point managed to acquire 896 GB of memory. In fact when we first tried to start the instance at with a 400GB SGA Oracle couldn’t start up at all until the system administrator had reduced the filesystem cache and freed up most of the memory; even then so much of the memory had been allocated originally in 8KB pages that Oracle had made a complete mess of building a 400GB memory map.

I hadn’t passed all these details to the Oak Table but the justification for the suggested course of action (which came from Tanel Poder) sounded like a perfect description of what had been happening up to that point. In total his advice was:

  • limit the ZFS ARC cache (with two possible strategies suggested)
  • use sga_target instead of memory_target (to avoid a similar problem on memory resize operations)
  • start the instance immediately after the reboot

Maxim: Sometimes the solution you produce after careful analysis of the facts looks exactly like the solution you produce when you can’t think of anything else to do.

14 Comments »

  1. This reminds me of this “AI Koan” from the old Jargon File: http://catb.org/~esr/jargon/html/koans.html

    -=-=-=-=-=-=-

    A novice was trying to fix a broken Lisp machine by turning the power off and on.

    Knight, seeing what the student was doing, spoke sternly: “You cannot fix a machine by just power-cycling it with no understanding of what is going wrong.”

    Knight turned the machine off and on.

    The machine worked.

    Comment by Jason Bucata — October 13, 2014 @ 6:52 pm BST Oct 13,2014 | Reply

  2. We had the same exact problem during our installation and running our application on a T5-2. We ended up with the following configuration for /etc/system

    set zfs:zfs_arc_max=8589934592
    set zfs:zfs_arc_shrink_shift=7
    set zfs:zfs_vdev_max_pending=32
    set disable_ism_large_pages=0x74

    Comment by Dave Ryan — October 13, 2014 @ 6:56 pm BST Oct 13,2014 | Reply

    • Dave,

      Thanks for the comment. I think zfs_arc_max may have been how we set the limit, though Tanel also mentioned 11.2 has introduce a parameter user_reserve_hint_pct.

      Comment by Jonathan Lewis — October 14, 2014 @ 7:04 am BST Oct 14,2014 | Reply

      • Jonathan,

        I observed that setting user_reserve_hint_pct, which came with Solaris 11.2, causes permanent memory deallocations which in turn generate cross-calls. This can result in odd performance effects and increased kernel CPU time (see ZFS ARC Resizing. In my case, it boosted the latencies on the HBA driver. The problem was cured by unsetting the parameter.

        Comment by Nenad Noveljic — May 9, 2016 @ 9:05 pm BST May 9,2016 | Reply

  3. Jonathan, I think the most relevant question is unanswered for us (however it is probably clear to you). Was the system using ISM or DISM (or it could be even OSM for 12c)? As of now it seems ISM is the only valid option for 400GB SGA (OSM is not recommended even for 12c right now and DISM is even not supported on Supercluster. Yes, T5-2 is not SuperCluster – Exadata Storages cannot be connected, however the point is the same, big amount of memory).

    Moreover, T5 is obviously next area, where Oracle Marketing is much faster when compared to real processor speed.

    Regards
    Pavol

    Comment by Pavol Babel — October 13, 2014 @ 10:29 pm BST Oct 13,2014 | Reply

    • Pavol,

      Thanks for the comment – the memory came up as OSM (optimised shared memory), and pgrep showed us vmtasks running in the global area with a threads setting of 257. I didn’t come across anything suggesting that OSM was not recommended – do you have any relevant links I could read ?

      Comment by Jonathan Lewis — October 14, 2014 @ 7:06 am BST Oct 14,2014 | Reply

      • Jonathan,

        it seems several hidden parameters take place in onecommand template on SuperCluster T5/M6 for 12c DB. One of them is _use_osm=FALSE (even for 12.1.0.2). Unfortunately, this is based only on some hidden MOS note. Obviously, there is a reason for that.
        I would definitely go for old good ISM (which means disabling memory_target, which is for me still unusable feature, or very close to that and the best way is to remove sga_max_size from spfile). There is one more important thing, to be honest I do not know if this apply also to OSM, but for DISM, you need for every page from SGA also swap reservation on Solaris. System performs reservations to physical swap, if not big enough, reservations go to memory, which decreases amount of usable memory (it is pseudo swap concept). If you do not want to spend 1TB of swap nowadays, it is better to use ISM, which is locked (aka pinned) and no swap reservations are needed for them. DISM is event not supported on SuperCluster platform (and the important note there is big amount of memory).

        The think with ZFS cache is interesting as well. Maybe slightly disappointing for me, Linux always uses free memory for FS cache, byt shrinks it immediately when another request for memory allocation kicks in. In IBM AIX, this works even better. Have seen Exadata with Solaris 11 on computing nodes, T5 and M6 as well (m6 only in lab) and the installation template always set zfs_arc_max to /etc/system. So it can be considered as “best practice” for Solaris, at least to my mind.

        Comment by Pavol Babel — October 14, 2014 @ 12:26 pm BST Oct 14,2014 | Reply

        • To be honest, the need of setting zfs_arc_max comes from a bug…

          Comment by Pavol Babel — October 14, 2014 @ 8:48 pm BST Oct 14,2014

        • “System performs reservations to physical swap, if not big enough, reservations go to memory, which decreases amount of usable memory (it is pseudo swap concept).” Thanks for posting. I was puzzled by the differences in behaviour: on some servers starting Oracle with DISM caused Solaris to allocate swap space equal to SGA size while on other servers Oracle runs with DISM and the swap is empty. Let’s say I have a server with 128 GB memory and Oracle SGA size is set to 50 GB, swap size is 8 GB. Does t mean that if Oracle is started with DISM then 100 GB will be used: 50 GB for SGA and another 50 GB for pseudo swap?

          Comment by Vsevolod Afanassiev — October 15, 2014 @ 9:52 am BST Oct 15,2014

        • Hi Vsevolod,

          Yes, you have quite good accounting for that situation, I think. We also ran into similar problems since Exadata X3-2 Solaris version was delivered with 4GB of swap. The problems started after we had migrated one DB with huge PGA usage. And then (by mistake), one DB was switched to use DISM and we faced huge issues in production system. Of course the DISM usage was fixed and then we filed a SR (well it was tough, since no one from oracle seemed to understand the pseudo swap concept) and they promised for future releases to set swap 64GB, which is quite reasonable for 256GB memory (and ISM is still recommended for 11gR2 on engineered systems, to be more precise, it is maybe the only supported option :) ).

          The pseudo swap concept is quite similar to the HPUX one. AIX and Linux do not go like this, however it can be somehow turned on for AIX too. For 1TB memory, it is quite stupid do spend 256-512MB for swap, which will not be ever used (hopefully). On the other hand, it sometimes happens on AIX / Linux that whole swap is consumed in case of a developer bug and has to be restarted :) This situation is less likely on Solaris/HPUX (but still can happen of course :))

          Comment by Pavol Babel — October 16, 2014 @ 11:35 am BST Oct 16,2014

  4. I had a look at one of our systems (Oracle 9.2 with UFS on Solaris 10 8/11): in production the SGA consists off 14,000 x 4MB pages. In PT there are 16 chunks based on 8KB pages (total = 1.3 million 8KB pages) and another 16 chunks based on 4 MB pages. SGA size = 54 GB in both environments, both use ISM (I intentionally set the sum of set db_cache_size and shared_pool_size very close to sga_max_size to use ISM and avoid swap problems). Performance seems fine.
    In the test with 400 GB SGA you had approx 30 – 50 million 8KB pages. Where is the breaking point? How many 8KB pages one needs to have to notice slow performance?

    Slightly off topic: what kind of workload requires 400 GB SGA and why not set it even higher, to 800 GB or 950 GB? I assume most of it gets used for db cache.

    Comment by Vsevolod Afanassiev — October 15, 2014 @ 9:39 am BST Oct 15,2014 | Reply

  5. Vsevolvod,

    do not forget in-memory so maybe that’s why the sga has been set to 400GB ;) And maybe there is rest for PGA reserved ;)

    Comment by Pavol Babel — October 16, 2014 @ 10:10 pm BST Oct 16,2014 | Reply

  6. […] Oracle Scratchpad post about OSM and 8KB pages […]

    Pingback by The Not-So-Dynamic SGA | NZ DBA: Oracle Database Tuning & Tips — June 11, 2015 @ 8:00 am BST Jun 11,2015 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.