Oracle Scratchpad

September 29, 2022

Case Study

Filed under: Execution plans,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 6:27 pm BST Sep 29,2022

A recent question on the Oracle Developer Community forum asked for help with a statement that was taking a long time to run. The thread included the results from a trace file that had been passed through tkprof so we have the query and the actual execution plan with some rowsource execution stats.

Here’s the query – extracted from the tkprof output:

SELECT DISTINCT
       pll.po_line_id,
       ploc.line_location_id,
       (SELECT ptl.line_type
          FROM apps.po_line_types_tl ptl
         WHERE ptl.line_type_id = pll.line_type_id AND ptl.LANGUAGE = 'US')
           "Line_Type",
       ploc.quantity_accepted,
       NULL
           release_approved_date,
       NULL
           release_date,
       NULL
           release_hold_flag,
       NULL
           release_type,
       DECODE (ploc.po_release_id, NULL, NULL, ploc.quantity)
           released_quantity,
       (SELECT items.preprocessing_lead_time
          FROM apps.mtl_system_items_b items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id = ploc.SHIP_TO_ORGANIZATION_ID)
           "PreProcessing_LT",
       (SELECT items.full_lead_time
          FROM apps.mtl_system_items_b items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id = ploc.SHIP_TO_ORGANIZATION_ID)
           "Processing_LT",
       (SELECT items.postprocessing_lead_time
          FROM apps.mtl_system_items_b items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id = ploc.SHIP_TO_ORGANIZATION_ID)
           "PostProcessing_LT",
       ploc.firm_status_lookup_code,
       NVL (
           (SELECT pla.promised_date
              FROM apps.po_line_locations_archive_all pla
             WHERE     pla.po_header_id = pha.po_header_id
                   AND pla.po_line_id = pll.po_line_id
                   AND pla.line_location_id = ploc.line_location_id
                   AND pla.revision_num =
                       (SELECT MIN (revision_num)
                          FROM apps.po_line_locations_archive_all plla2
                         WHERE     plla2.promised_date IS NOT NULL
                               AND plla2.line_location_id =
                                   ploc.line_location_id)),
           ploc.promised_date)
           "Original_Promise_Date",
       (SELECT items.long_description
          FROM apps.mtl_system_items_tl items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id IN
                       (SELECT fin.inventory_organization_id
                          FROM apps.financials_system_params_all fin
                         WHERE fin.org_id = pha.org_id)
               AND items.LANGUAGE = 'US')
           "Item_Long_Description",
       NVL (ploc.approved_flag, 'N')
           approved_code,
       pvs.country
           "Supplier_Site_Country",
       pll.note_to_vendor,
         NVL (ploc.quantity, 0)
       - NVL (ploc.quantity_cancelled, 0)
       - NVL (ploc.quantity_received, 0) * ploc.price_override
           "Shipment_Amount",
       ploc.attribute4
           "PO_Ship_Date",
       (SELECT meaning
          FROM apps.fnd_lookup_values
         WHERE     lookup_type = 'SHIP_METHOD'
               AND lookup_code = ploc.attribute9
               AND language = 'US')
           "Ship_Method",
       (SELECT prla.note_to_receiver
          FROM apps.po_req_distributions_all  prda
               INNER JOIN apps.po_requisition_lines_all prla
                   ON prda.requisition_line_id = prla.requisition_line_id
         WHERE prda.distribution_id = pdi.req_distribution_id)
           "Note_To_Receiver",
       DECODE (pha.USER_HOLD_FLAG, 'Y', 'Y', pll.USER_HOLD_FLAG)
           "Hold_Flag",
       (SELECT ABC_CLASS_NAME
          FROM APPS.MTL_ABC_ASSIGNMENT_GROUPS  ASG
               INNER JOIN APPS.MTL_ABC_ASSIGNMENTS ASSI
                   ON ASG.ASSIGNMENT_GROUP_ID = ASSI.ASSIGNMENT_GROUP_ID
               INNER JOIN APPS.MTL_ABC_CLASSES classes
                   ON ASSI.ABC_CLASS_ID = classes.ABC_CLASS_ID
         WHERE     ASG.organization_id = ploc.SHIP_TO_ORGANIZATION_ID
               AND ASG.ASSIGNMENT_GROUP_NAME = 'MIN ABC Assignment'
               AND ASSI.inventory_item_id = pll.item_id)
           ABCClass,
       (SELECT CONCATENATED_SEGMENTS AS charge_accountsfrom
          FROM apps.gl_code_combinations_kfv gcc
         WHERE gcc.code_combination_id = pdi.code_combination_id)
           AS charge_accounts
  FROM apps.po_headers_all         pha,
       apps.po_lines_all           pll,
       apps.po_line_locations_all  ploc,
       apps.po_distributions_all   pdi,
       apps.per_all_people_f       papf,
       apps.AP_SUPPLIERS           pv,
       apps.AP_SUPPLIER_SITES_ALL  pvs,
       apps.AP_SUPPLIER_CONTACTS   pvc,
       apps.ap_terms               apt,
       apps.po_lookup_codes        plc1,
       apps.po_lookup_codes        plc2,
       apps.hr_locations           hlv_line_ship_to,
       apps.hr_locations           hlv_ship_to,
       apps.hr_locations           hlv_bill_to,
       apps.hr_organization_units  hou,
       apps.hr_locations_no_join   loc,
       apps.hr_locations_all_tl    hrl1,
       apps.hr_locations_all_tl    hrl2
 WHERE     1 = 1
       AND pll.po_header_id(+) = pha.po_header_id
       AND ploc.po_line_id(+) = pll.po_line_id
       AND pdi.line_location_id(+) = ploc.line_location_id
       AND ploc.shipment_type IN ('STANDARD', 'PLANNED')
       AND papf.person_id(+) = pha.agent_id
       AND TRUNC (SYSDATE) BETWEEN papf.effective_start_date
                               AND papf.effective_end_date
       AND papf.employee_number IS NOT NULL
       AND pv.vendor_id(+) = pha.vendor_id
       AND pvs.vendor_site_id(+) = pha.vendor_site_id
       AND pvc.vendor_contact_id(+) = pha.vendor_contact_id
       AND apt.term_id(+) = pha.terms_id
       AND plc1.lookup_code(+) = pha.fob_lookup_code
       AND plc1.lookup_type(+) = 'FOB'
       AND plc2.lookup_code(+) = pha.freight_terms_lookup_code
       AND plc2.lookup_type(+) = 'FREIGHT TERMS'
       AND hlv_line_ship_to.location_id(+) = ploc.ship_to_location_id
       AND hlv_ship_to.location_id(+) = pha.ship_to_location_id
       AND hlv_bill_to.location_id(+) = pha.bill_to_location_id
       AND hou.organization_id = pha.org_id
       AND hou.location_id = loc.location_id(+)
       AND hrl1.location_id(+) = pha.ship_to_location_id
       AND hrl1.LANGUAGE(+) = 'US'
       AND hrl2.location_id(+) = pha.bill_to_location_id
       AND hrl2.LANGUAGE(+) = 'US'
       AND hou.organization_id IN (2763)
       AND NVL (pha.closed_code, 'OPEN') IN ('OPEN', 'CLOSED')
       AND NVL (pll.closed_code, 'OPEN') IN ('OPEN', 'CLOSED')
       AND NVL (ploc.cancel_flag, 'N') = 'N'
       AND pha.authorization_status IN
               ('APPROVED', 'REQUIRES REAPPROVAL', 'IN PROCESS')

As you can see there are 10 inline scalar subqueries (highlighted) in the query with a select distinct to finish off the processing of an 18 table join. That’s a lot of scalar subqueries so it’s worth asking whether the code should be rewritten to use joins (though in newer vesions of Oracle some of the subqueries might be transformed to outer joins anyway – but the OP is using 11.2.0.4). We also know that a distinct is sometimes a hint that the code has a logic error that has been “fixed” by eliminating duplicates.

Ignoring those points, let’s consider the execution plan from the tkprof output which (with a tiny bit of extra formatting) is as follows:

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.46       1.75          0          3          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    50346    279.02    1059.39     179103   30146895          0      755164
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    50348    279.49    1061.14     179103   30146898          0      755164

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 678  
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         9          9          9  TABLE ACCESS BY INDEX ROWID PO_LINE_TYPES_TL (cr=20 pr=0 pw=0 time=680 us cost=2 size=32 card=1)
         9          9          9   INDEX UNIQUE SCAN PO_LINE_TYPES_TL_U1 (cr=11 pr=0 pw=0 time=323 us cost=1 size=0 card=1)(object id 63682480)

    576365     576365     576365  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B (cr=2267756 pr=28 pw=0 time=22598079 us cost=4 size=13 card=1)
    576365     576365     576365   INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (cr=1720936 pr=0 pw=0 time=4644552 us cost=3 size=0 card=1)(object id 42812859)

    576365     576365     576365  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B (cr=2267747 pr=0 pw=0 time=2442479 us cost=4 size=13 card=1)
    576365     576365     576365   INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (cr=1720936 pr=0 pw=0 time=1238342 us cost=3 size=0 card=1)(object id 42812859)

    576365     576365     576365  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B (cr=2267743 pr=0 pw=0 time=2029190 us cost=4 size=14 card=1)
    576365     576365     576365   INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (cr=1720932 pr=0 pw=0 time=967729 us cost=3 size=0 card=1)(object id 42812859)

    672743     672743     672743  TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=5507736 pr=163043 pw=0 time=535914552 us cost=3 size=27 card=1)
    672743     672743     672743   INDEX UNIQUE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=4560824 pr=163043 pw=0 time=533161038 us cost=2 size=0 card=1)(object id 42811947)
    755121     755121     755121    SORT AGGREGATE (cr=3540960 pr=163043 pw=0 time=530079821 us)
   1040963    1040963    1040963     TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=3540960 pr=163043 pw=0 time=534243973 us cost=5 size=15 card=1)
   1776649    1776649    1776649      INDEX RANGE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=1123074 pr=6392 pw=0 time=37128373 us cost=3 size=0 card=2)(object id 42811947)

    587486     587486     587486  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_TL (cr=3436629 pr=3564 pw=0 time=64125044 us cost=5 size=34 card=1)
    587486     587486     587486   INDEX RANGE SCAN MTL_SYSTEM_ITEMS_TL_U1 (cr=2852930 pr=869 pw=0 time=45628505 us cost=4 size=0 card=1)(object id 136492495)
         1          1          1    TABLE ACCESS BY INDEX ROWID FINANCIALS_SYSTEM_PARAMS_ALL (cr=645351 pr=0 pw=0 time=5743158 us cost=2 size=10 card=1)
    322268     322268     322268     INDEX SKIP SCAN FINANCIALS_SYSTEM_PARAMS_U1 (cr=323083 pr=0 pw=0 time=5104895 us cost=1 size=0 card=1)(object id 42770563)

        10         10         10  TABLE ACCESS BY INDEX ROWID FND_LOOKUP_VALUES (cr=51 pr=1 pw=0 time=3620 us cost=5 size=60 card=1)
        20         20         20   INDEX RANGE SCAN FND_LOOKUP_VALUES_X99 (cr=31 pr=1 pw=0 time=2133 us cost=4 size=0 card=1)(object id 42759866)

    634276     634276     634276  NESTED LOOPS  (cr=3540930 pr=5535 pw=0 time=181518759 us cost=5 size=28 card=1)
    634276     634276     634276   TABLE ACCESS BY INDEX ROWID PO_REQ_DISTRIBUTIONS_ALL (cr=1631471 pr=5253 pw=0 time=65405333 us cost=3 size=12 card=1)
    634276     634276     634276    INDEX UNIQUE SCAN PO_REQ_DISTRIBUTIONS_U1 (cr=994522 pr=5252 pw=0 time=31023194 us cost=2 size=0 card=1)(object id 42788583)
    634276     634276     634276   TABLE ACCESS BY INDEX ROWID PO_REQUISITION_LINES_ALL (cr=1909459 pr=282 pw=0 time=115275921 us cost=2 size=16 card=1)
    634276     634276     634276    INDEX UNIQUE SCAN PO_REQUISITION_LINES_U1 (cr=944449 pr=268 pw=0 time=12285440 us cost=1 size=0 card=1)(object id 42789681)

    511989     511989     511989  NESTED LOOPS  (cr=3533763 pr=6 pw=0 time=8999321 us cost=5 size=55 card=1)
    511989     511989     511989   NESTED LOOPS  (cr=2850293 pr=6 pw=0 time=7086027 us cost=4 size=45 card=1)
    576055     576055     576055    TABLE ACCESS BY INDEX ROWID MTL_ABC_ASSIGNMENT_GROUPS (cr=612378 pr=0 pw=0 time=2002832 us cost=2 size=29 card=1)
    576055     576055     576055     INDEX UNIQUE SCAN MTL_ABC_ASSIGNMENT_GROUPS_U2 (cr=36323 pr=0 pw=0 time=951307 us cost=1 size=0 card=1)(object id 42783622)
    511989     511989     511989    TABLE ACCESS BY INDEX ROWID MTL_ABC_ASSIGNMENTS (cr=2237915 pr=6 pw=0 time=4672006 us cost=3 size=16 card=1)
    511989     511989     511989     INDEX UNIQUE SCAN MTL_ABC_ASSIGNMENTS_U1 (cr=1551490 pr=4 pw=0 time=2533524 us cost=2 size=0 card=1)(object id 42757737)
    511989     511989     511989   TABLE ACCESS BY INDEX ROWID MTL_ABC_CLASSES (cr=683470 pr=0 pw=0 time=1488045 us cost=1 size=10 card=1)
    511989     511989     511989    INDEX UNIQUE SCAN MTL_ABC_CLASSES_U1 (cr=171481 pr=0 pw=0 time=693745 us cost=0 size=0 card=1)(object id 42789694)

     13320      13320      13320  TABLE ACCESS BY INDEX ROWID GL_CODE_COMBINATIONS (cr=34801 pr=0 pw=0 time=802675 us cost=3 size=49 card=1)
     13320      13320      13320   INDEX UNIQUE SCAN GL_CODE_COMBINATIONS_U1 (cr=21481 pr=0 pw=0 time=397344 us cost=2 size=0 card=1)(object id 42775044)


    755164     755164     755164  HASH UNIQUE (cr=30147018 pr=179103 pw=0 time=1058922684 us cost=749257 size=197349453 card=482517)
    768890     768890     768890   HASH JOIN  (cr=7289842 pr=6926 pw=0 time=244582512 us cost=696202 size=197349453 card=482517)
    140451     140451     140451    TABLE ACCESS FULL PER_ALL_PEOPLE_F (cr=38207 pr=0 pw=0 time=313692 us cost=18484 size=13278261 card=428331)
    768890     768890     768890    NESTED LOOPS OUTER (cr=7251635 pr=6926 pw=0 time=242897348 us cost=672652 size=30016980 card=79410)
    755121     755121     755121     NESTED LOOPS OUTER (cr=5538283 pr=6031 pw=0 time=154841427 us cost=443987 size=28382903 card=78623)
    755121     755121     755121      NESTED LOOPS OUTER (cr=5508916 pr=6031 pw=0 time=153523676 us cost=443982 size=18184959 card=51809)
    755121     755121     755121       NESTED LOOPS OUTER (cr=5386279 pr=6031 pw=0 time=151985656 us cost=443978 size=11642422 card=34142)
    755121     755121     755121        NESTED LOOPS  (cr=5090949 pr=6031 pw=0 time=139220421 us cost=375644 size=11574138 card=34142)
    792959     792959     792959         NESTED LOOPS  (cr=1747964 pr=134 pw=0 time=64597738 us cost=109035 size=19934760 card=73560)
    254919     254919     254919          HASH JOIN OUTER (cr=315780 pr=6 pw=0 time=14811187 us cost=29121 size=5413350 card=22650)
    254919     254919     254919           NESTED LOOPS OUTER (cr=286919 pr=0 pw=0 time=12395919 us cost=13792 size=5209500 card=22650)
    254919     254919     254919            HASH JOIN RIGHT OUTER (cr=107134 pr=0 pw=0 time=12153146 us cost=13790 size=3868572 card=17426)
      3834       3834       3834             VIEW  HR_LOCATIONS (cr=3913 pr=0 pw=0 time=15826 us cost=125 size=360 card=60)
      3834       3834       3834              NESTED LOOPS  (cr=3913 pr=0 pw=0 time=15055 us cost=125 size=1080 card=60)
      3834       3834       3834               TABLE ACCESS FULL HR_LOCATIONS_ALL (cr=262 pr=0 pw=0 time=11211 us cost=125 size=304 card=38)
      3834       3834       3834               INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=3651 pr=0 pw=0 time=6183 us cost=0 size=20 card=2)(object id 42783719)
    254919     254919     254919             HASH JOIN RIGHT OUTER (cr=103221 pr=0 pw=0 time=11917174 us cost=13666 size=3764016 card=17426)
      3834       3834       3834              VIEW  HR_LOCATIONS (cr=3898 pr=0 pw=0 time=14651 us cost=125 size=360 card=60)
      3834       3834       3834               NESTED LOOPS  (cr=3898 pr=0 pw=0 time=14267 us cost=125 size=1080 card=60)
      3834       3834       3834                TABLE ACCESS FULL HR_LOCATIONS_ALL (cr=247 pr=0 pw=0 time=9532 us cost=125 size=304 card=38)
      3834       3834       3834                INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=3651 pr=0 pw=0 time=9539 us cost=0 size=20 card=2)(object id 42783719)
    254919     254919     254919              HASH JOIN RIGHT OUTER (cr=99323 pr=0 pw=0 time=11817243 us cost=13541 size=3659460 card=17426)
        45         45         45               INDEX RANGE SCAN FND_LOOKUP_VALUES_U1 (cr=21 pr=0 pw=0 time=614 us cost=4 size=49 card=1)(object id 63685210)
    254919     254919     254919               HASH JOIN RIGHT OUTER (cr=99302 pr=0 pw=0 time=11729251 us cost=13537 size=2805586 card=17426)
        59         59         59                INDEX RANGE SCAN FND_LOOKUP_VALUES_U1 (cr=20 pr=0 pw=0 time=445 us cost=4 size=49 card=1)(object id 63685210)
    254919     254919     254919                NESTED LOOPS  (cr=99282 pr=0 pw=0 time=11653162 us cost=13533 size=1951712 card=17426)
         1          1          1                 NESTED LOOPS OUTER (cr=116 pr=0 pw=0 time=113273 us cost=3 size=40 card=1)
         1          1          1                  NESTED LOOPS  (cr=113 pr=0 pw=0 time=113227 us cost=2 size=32 card=1)
         1          1          1                   INDEX UNIQUE SCAN HR_ALL_ORGANIZATION_UNTS_TL_PK (cr=110 pr=0 pw=0 time=113164 us cost=1 size=17 card=1)(object id 63680720)
         1          1          1                   TABLE ACCESS BY INDEX ROWID HR_ALL_ORGANIZATION_UNITS (cr=3 pr=0 pw=0 time=59 us cost=1 size=15 card=1)
         1          1          1                    INDEX UNIQUE SCAN HR_ORGANIZATION_UNITS_PK (cr=2 pr=0 pw=0 time=7 us cost=0 size=0 card=1)(object id 42789144)
         1          1          1                  TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=3 pr=0 pw=0 time=42 us cost=1 size=8 card=1)
         1          1          1                   INDEX UNIQUE SCAN HR_LOCATIONS_PK (cr=2 pr=0 pw=0 time=7 us cost=0 size=0 card=1)(object id 42797079)
    254919     254919     254919                 TABLE ACCESS BY INDEX ROWID PO_HEADERS_ALL (cr=99166 pr=0 pw=0 time=11505632 us cost=13530 size=1254672 card=17426)
    255397     255397     255397                  INDEX SKIP SCAN PO_HEADERS_ALL_X3 (cr=1753 pr=0 pw=0 time=725236 us cost=352 size=0 card=37674)(object id 42773719)
    254883     254883     254883            INDEX UNIQUE SCAN AP_TERMS_TL_U1 (cr=179785 pr=0 pw=0 time=183291 us cost=0 size=8 card=1)(object id 42798416)
    482528     482528     482528           TABLE ACCESS FULL AP_SUPPLIER_SITES_ALL (cr=28861 pr=6 pw=0 time=227983 us cost=13727 size=4323123 card=480347)
    792959     792959     792959          TABLE ACCESS BY INDEX ROWID PO_LINES_ALL (cr=1432184 pr=128 pw=0 time=53002963 us cost=5 size=96 card=3)
    793375     793375     793375           INDEX RANGE SCAN PO_LINES_U2 (cr=504726 pr=20 pw=0 time=17603112 us cost=2 size=0 card=5)(object id 42755253)
    755121     755121     755121         TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ALL (cr=3342985 pr=5897 pw=0 time=71357938 us cost=4 size=68 card=1)
   1138558    1138558    1138558          INDEX RANGE SCAN PO_LINE_LOCATIONS_N15 (cr=1707311 pr=5830 pw=0 time=37903421 us cost=3 size=0 card=2)(object id 63697005)
    723002     723002     723002        VIEW PUSHED PREDICATE  HR_LOCATIONS (cr=295330 pr=0 pw=0 time=11391536 us cost=2 size=2 card=1)
    723002     723002     723002         NESTED LOOPS  (cr=295330 pr=0 pw=0 time=11004720 us cost=2 size=18 card=1)
    723002     723002     723002          INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=146911 pr=0 pw=0 time=1391389 us cost=1 size=10 card=1)(object id 42783719)
    723002     723002     723002          TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=148419 pr=0 pw=0 time=9233363 us cost=1 size=8 card=1)
    723002     723002     723002           INDEX UNIQUE SCAN HR_LOCATIONS_PK (cr=117800 pr=0 pw=0 time=836734 us cost=0 size=0 card=1)(object id 42797079)
    755119     755119     755119       INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=122637 pr=0 pw=0 time=829404 us cost=0 size=20 card=2)(object id 42783719)
    755121     755121     755121      INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=29367 pr=0 pw=0 time=716408 us cost=0 size=20 card=2)(object id 42783719)
    768883     768883     768883     TABLE ACCESS BY INDEX ROWID PO_DISTRIBUTIONS_ALL (cr=1713352 pr=895 pw=0 time=75314769 us cost=3 size=17 card=1)
    768883     768883     768883      INDEX RANGE SCAN PO_DISTRIBUTIONS_N1 (cr=1096671 pr=874 pw=0 time=24392643 us cost=2 size=0 card=1)(object id 42782429)

The plan is a bit long, but you may recall that a query with scalar subqueries in the select list reports the plans for each of the separate scalar subqueries before reporting the main query block – and I’ve inserted blank lines in the output above to improve the visibility of the individual blocks / scalar subqueries.

An odd little detail of this tkprof output was that there was no report of the wait information recorded against the query, though the following information appeared as the summary for the trace file, giving us a very good idea of the wait events for the individual query:

OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        6      0.85       2.14          0          6          0           0
Execute      6      0.00       0.00          0          7        104          85
Fetch    50358    279.03    1059.39     179103   30146895          0      755329
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    50370    279.88    1061.54     179103   30146908        104      755414

Misses in library cache during parse: 3

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                   50363        0.00          0.00
  SQL*Net message from client                 50362      157.17        227.70
  row cache lock                                141        0.03          0.67
  library cache lock                             77        0.01          0.21
  library cache pin                              75        0.01          0.27
  Disk file operations I/O                      791        0.00          0.01
  gc current block 3-way                     835881        0.15        305.35
  gc current block 2-way                     471360        0.24        144.04
  KJC: Wait for msg sends to complete            40        0.00          0.00
  gc cr multi block request                       8        0.00          0.00
  gc current block congested                  10014        0.03          4.23
  gc cr block 3-way                           20215        0.06          4.69
  gc current grant busy                          20        0.00          0.00
  gc cr grant 2-way                          165010        0.07         25.13
  db file sequential read                    179103        0.05        196.31
  gc cr grant congested                         729        0.19          0.36
  gc current block busy                       71431        0.05        118.15
  gc cr block 2-way                            1800        0.01          0.31
  latch free                                      3        0.00          0.00
  gc cr block congested                         197        0.01          0.06
  latch: cache buffers chains                    45        0.00          0.00
  latch: gc element                              15        0.00          0.00
  gc cr block busy                               15        0.02          0.07
  latch: object queue header operation            1        0.00          0.00
  KSV master wait                                 2        0.00          0.00
  ASM file metadata operation                     1        0.00          0.00
  SQL*Net more data to client                     1        0.00          0.00
  gc current grant 2-way                          6        0.00          0.00

An important initial observation is that the query returned 750,000 rows in 50,000 fetches (all figures rounded for convenience) and that’s consistent with the SQL*Plus default arraysize of 15. So there might be a little time saved by setting the arraysize to a larger value (but only a few 10s of seconds – based on the 227 seconds total minus the 157 second maximum wait for the “SQL*Net message from client” figures and there may be some benefit of increasing the SQL*Net SDU_SIZE at the same time). Critically, though, we should ask “why do you want a query to return 750,000 rows?”, and “how fast do you think is ‘reasonable’?” You’ll also note from the “gc” waits that the system is based on RAC with at least 3 nodes – and RAC is always a suspect when you see unexpected time spent in a query.

Where in the driving query block does most of the time go between the last hash join (line 62) and the hash unique (line 61) – it’s in the query block whose plan starts at line 28 where we see 163,000 physical blocks read (pr=) and 535 seconds (time= microseconds) of which 6,400 blocks come from the index range scan operation at line 32 but most comes from line 31 fetching 1 million rows (by index rowid) from table po_lines_locations_archive_all.

    672743     672743     672743  TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=5507736 pr=163043 pw=0 time=535914552 us cost=3 size=27 card=1)
    672743     672743     672743   INDEX UNIQUE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=4560824 pr=163043 pw=0 time=533161038 us cost=2 size=0 card=1)(object id 42811947)
    755121     755121     755121    SORT AGGREGATE (cr=3540960 pr=163043 pw=0 time=530079821 us)
   1040963    1040963    1040963     TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=3540960 pr=163043 pw=0 time=534243973 us cost=5 size=15 card=1)
   1776649    1776649    1776649      INDEX RANGE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=1123074 pr=6392 pw=0 time=37128373 us cost=3 size=0 card=2)(object id 42811947)

This part of the workload comes from 672,743 executions of the subquery starting at line 36 of the original query text:

           (SELECT pla.promised_date
              FROM apps.po_line_locations_archive_all pla
             WHERE     pla.po_header_id = pha.po_header_id
                   AND pla.po_line_id = pll.po_line_id
                   AND pla.line_location_id = ploc.line_location_id
                   AND pla.revision_num =
                       (SELECT MIN (revision_num)
                          FROM apps.po_line_locations_archive_all plla2
                         WHERE     plla2.promised_date IS NOT NULL
                               AND plla2.line_location_id =
                                   ploc.line_location_id))

If we want to improve the performance of this query with a minimum of re-engineering, recoding and risk then a good point to start would be to examine this query block in isolation and see if there is a simple, low-cost way of improving its efficiency. (Note: this may not be a route to optimising the whole query “properly”, but it may give a quick win that is “good enough”.)

We could go a little further down this route of optimising the scalar subqueries by looking at the time spent in each of them in turn. Taking out the top line of each of the separate sections of the plan and extracting just the pr, pw and time values (which I’ll scale back from microseconds to seconds) we get the following

pr=      0      pw=0    time=   0
pr=     28      pw=0    time=  23
pr=      0      pw=0    time=   2
pr=      0      pw=0    time=   2
pr= 163043      pw=0    time= 536
pr=   3564      pw=0    time=  64
pr=      1      pw=0    time=   0
pr=   5535      pw=0    time= 182
pr=      6      pw=0    time=   9
pr=      0      pw=0    time=   1

The 8th scalar subquery (line 42 in the plan, line 75 in the query) gives us an opportunity to reduce the run time by 182 seconds, so might be worth a little investment in programmer time.

The 6th subquery (line 34 in the plan, line 49 in the query) adds only 64 seconds to the run time, so we might be less inclined to do anything about it.

You might note that the 2nd, 3rd and 4th subqueries are against the same table with the same predicate to get three different columns – this group is the “obvious” choice for recoding as a single join rather than three separate subqueries, but if you look at the total times of the three subqueries the “extra” two executions add only two seconds each to the total time – so although the this scalar subquery coding pattern is undesirable, it’s not necessarily going to be worth expending the effort to rewrite it in this case.

If you’re wondering, by the way, why different subqueries are reporting different numbers of rows returned (and each one should return at most one row on each execution), there are two reasons for any subquery to be reporting fewer than the 768,890 rows reported by the basic driving hash join:

  • first – an execution may simply return no rows,
  • secondly – there may be some benefits from scalar subquery caching.

One of the nice details about newer versions of Oracle is that the “starts” statistic is also reported in the trace/tkprof output so you would be able to see how much your query had benefited from scalar subquery caching.

If we add together the time reported by each of the scalar subquery sections of the plan the total time reported is approximately 819 seconds. Cross-checking with the difference in the times reported for operations 61 and 62 (hash unique of hash join) we see: 1,059 seconds – 245 seconds = 814 seconds. This is a good match (allowing for the accumulation of a large number of small errors) for the 819 seconds reported in the subqueries – so the hash unique isn’t a significant part of the query even though it has virtually no effect on the volume of data. You’ll note that it didn’t spill to disc (pw = 0) but completed in memory.

Summary

I’ve written a quick note on this query because the coding style was undesirable and the execution plan quite lengthy. I’ve reviewed how the style of the SQL is echoed in the shape of the plan. I’ve then pursued the idea of optimising the code “piece-wise” to see if there were any opportunities for improving the performance “enough” without going through the effort of a complete redesign of the query. [Update: One of the participants in the thread is currently walking through the mechanics of manually transforming the most expensive scalar subquery into an outer join by unnesting.]

Given the information in the Rowsource Execution from the trkprof output it proved easy to identify where the largest amounts of times appeared that might be reduced by very localised optimsation.

In passing I pointed out the option for reducing the time spent on network traffic by increasing the array fetch size, and increasing the SDU_SIZE for the SQL*Net messages to client.

Footnote (addendum)

I made a passing reference to the waits that told us that the user was running RAC. These waits merit a few follow-up comments.

The numbers for “gc” waits are high. Of particular note are the 71,000 waits and 118 seconds waited on “gc current block busy” which wave a big red flag telling us that there’s too much DML modifying the same object(s) from multiple nodes at the same time. (The even larger numbers for the “gc current block 2/3- way” say the same, but “busy” really emphasises the “hot-spot” aspect of the problem.)

Ideally we would like to see exactly where in the execution plan the bulk of those waits is occurring and, since the OP has been able to supply a trace file for the query, it’s possible that the query can be re-run to produce the SQL Monitor report (if the OP is suitably licenced) that summarises the active session history (ASH) for each line of the plan.

If the ASH data were available for a run of the report we could then do some analysis of parameter values recorded in v$active_session_history to see if that supplied further information. Unfortunately the view v$event_name doesn’t tell us what the parameter values mean for most of the “gc current%” waits, but a couple of the ones that do have descriptions report parameter1 as the file# and parameter2 as the block#, so maybe that’s true for many of them. (For some of the waits parameter1 is listed as the “le” (lock element), which doesn’t really help very much.)

Another diagnostic that could be very helpful is to take a snapshot of the session activity stats (v$sesstat)for the session as this might tell us that part of the load comes from “unexpected” work going on. In particular if we do an analysis of the “db file sequential read” waits we may find that many of the waits are for blocks in the undo tablespace, which would prompt us to examine the session stats to see what they tell us through the “% – undo records applied” statistics.

As a final comment – linking the “gc” activity back to my comment “That’s a lot of scalar subqueries so it’s worth asking whether the code should be rewritten to use joins” – if you add an extra table to a query with a simple join condition to add columns from that table to the query select list then Oracle can pin various index blocks; if you replace the join with a scalar subquery (which means you’re after just one column from one row each time) then Oracle has to pin and release all the index blocks on each call to the subquery. The benefit of the subquery approach is that scalar subquery caching may mean the subquery is rarely executed (check the highly suggestive stats in the plan for the first and seventh scalar subquery blocks – lines 16 and 39); the downside to the subquery approach is that you may end up spending a lot more time in buffer cache handling which, for RAC, includes the global cache (gc) management.

July 22, 2022

Trim CPU

Filed under: Execution plans,Hash Join,Joins,Oracle,Performance,Problem Solving — Jonathan Lewis @ 6:56 am BST Jul 22,2022

Prompted by an unexpectedly high CPU usage on a hash join of two large dadta sets Stefan Koehler posted a poll on twitter recently asking for opinions on the following code fragment:

FROM
    TAB1
INNER JOIN TAB2 ON
    TAB1.COL1 = TAB2.COL1
AND TRIM(TAB1.COL3) > TRIM(TAB2.COL3)

While I struggle to imagine a realistic business requirement for the second predicate and think it’s indicative of a bad data model, I think it is nevertheless quite instructive to use the example to show how a hash join can use a lot of CPU if the join includes a predicate that isn’t on equality.

Trivia

Before examining the potential for wasting CPU, I’ll just point out two problems with using the trim() function in this way – because (while I hope that col3 is character string in both tables) I’ve seen code that uses “to_date(to_char(date_column))” instead of trunc(date_column):

Cut-n-paste from SQL*Plus:

SQL> select 1 from dual where trim(100) > trim(20);

no rows selected

==================================================================

SQL> alter session set nls_date_format = 'dd-mon-yyyy hh24:mi:ss';

SQL> select d1, d2 from t2 where trim(d1) > trim(d2);

20-jul-2022 15:24:46 19-aug-2022 15:26:44

1 row selected.

SQL> alter session set nls_date_format = 'yyyy-mm-dd hh24:mi:ss';

SQL> select d1, d2 from t2 where trim(d1) > trim(d2);

no rows selected

The trim() function converts numerics and dates to strings using the default format for the session before the comparison takes place, so not only can you get unexpected (i.e. wrong) results, two users can get contradictory results from the same data at the same time because they’ve specified different session defaults!

The CPU issue

The critical point that everyone should remember is this: hash joins can only operate on equality (though, to avoid ambiguity, one should point out that “equality” does also mean “not equals”, which is why hash anti-joins can be efficient).

This means that even though the clause “where tab1.col1 = tab2.col1 and tab1.col3 > tab2.col3” might specify the matching rows for an individual tab1 row with high precision and great efficiency for a nested loop join with the right index, a hash join has a completely different workload. Every tab1 row has to have its col3 compared with every tab2 row that matches on col1. The secondary tests multiply up to “n-squared”, and if any col1 value is is highly repetitive then the work done on checking col3 becomes excessive.

It’s easier to see this in a worked example, so here’s some sample data:

rem
rem     Script:         trim_cost.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2022
rem
rem     Last tested 
rem             21.3.0.0
rem             19.11.0.0
rem

create table tab1 as select * from all_Objects where owner != 'PUBLIC' and object_type != 'SYNONYM' and rownum <= 200;

create table tab2 as select * from all_Objects where owner != 'PUBLIC' and object_type != 'SYNONYM';

On a new pdb in 19.11 and 21.3 the second statement gave me roughly 46,000 rows. checking owners and row counts I got the following results:

SQL> select owner, count(*) from tab1 group by owner;

OWNER                      COUNT(*)
------------------------ ----------
SYS                             128
SYSTEM                           65
OUTLN                             7

SQL> select owner, count(*) from tab2 group by owner;

OWNER                      COUNT(*)
------------------------ ----------
SYS                           40104
SYSTEM                          417
OUTLN                             7

... plus about 17 rows aggregating 6,000 rows

And here’s the query (indicating 4 variations) that I’m going to use to demonstrate the CPU issue, followed by its execution plan and rowsource_execution_statistics:

set serveroutput off
alter session set statistics_level = all;

select
        count(*)
from
        tab1
inner join 
        tab2 
on
        tab1.owner = tab2.owner
-- and  trim(tab1.object_name)  > trim(tab2.object_name)
-- and  rtrim(tab1.object_name) > rtrim(tab2.object_name)
-- and  ltrim(tab1.object_name) > ltrim(tab2.object_name)
and     tab1.object_name > tab2.object_name
;

select * from table(dbms_xplan.display_cursor(format=>'projection allstats last'));

SQL_ID  74m49y5av3mpg, child number 0
-------------------------------------
select  count(*) from  tab1 inner join  tab2 on  tab1.owner =
tab2.owner -- and trim(tab1.object_name)  > trim(tab2.object_name) -- and rtrim(tab1.object_name) > rtrim(tab2.object_name) 
-- and ltrim(tab1.object_name) > ltrim(tab2.object_name) and tab1.object_name > tab2.object_name

Plan hash value: 2043035240

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.39 |     942 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.39 |     942 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    101K|    329K|00:00:00.39 |     942 |  1335K|  1335K|  814K (0)|
|   3 |    TABLE ACCESS FULL| TAB1 |      1 |    200 |    200 |00:00:00.01 |       5 |       |       |          |
|   4 |    TABLE ACCESS FULL| TAB2 |      1 |  46014 |  46014 |00:00:00.01 |     937 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("TAB1"."OWNER"="TAB2"."OWNER")
       filter("TAB1"."OBJECT_NAME">"TAB2"."OBJECT_NAME")

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=0) COUNT(*)[22]
   2 - (#keys=1; rowset=407)
   3 - (rowset=256) "TAB1"."OWNER"[VARCHAR2,128], "TAB1"."OBJECT_NAME"[VARCHAR2,128]
   4 - (rowset=256) "TAB2"."OWNER"[VARCHAR2,128], "TAB2"."OBJECT_NAME"[VARCHAR2,128]

Comparing the basic colums the CPU time recorded at the Hash Join operation was 0.39 seconds, of which only a tiny amount was in the feeding tablescans. There are two things to note from the plan.

First is confirmation of my comments about the join having to be an equality and the inequality being applied later. You can see this in the Predicate Information in the way the user’s predicate list has been split at operation 2 into an access() predicate and a filter() predicate. The access predicate finds the right hash bucket and row(s) within bucket – the filter predicate is applied as a secondary test.

The second point to note is that the Column Projection Information shows us that the basic column values are passed up to the Hash Join, which tells us that the hash join operation has to do the trimming. The big question at that point is – how many times does the same value from the same incoming row get trimmed.

Remember that there are 128 rows in tab1 where where owner = ‘SYS’, so when a ‘SYS’ row arrives from tab2 the hash join has to find the right bucket then walk through the rows in that bucket (which will probably be nothing but those SYS rows). So how many times does Oracle evaluate trim(SYS). Arguably it needs to for each tab1 row in the bucket (though the hash table might have been built to include the trimmed value) but clearly it ought not to re-evaluate it 128 times for the column in the single tab2 row – and we’ll come back to that point later.

Let’s go back to the 3 variants on the first test. We were interested in the comparing trim() with trim(), but since trim() is equilavent to ltrim(rtrim()) I wondered if ltrim (left trim) and rtrim (right trim) took different amount of time, and whether the trim() time would be close to the sum of ltrim() time and rtrim() time.

Without showing the plans etc. here are the time reported in my 19.11.0.0 test at the hash join operation (the 21.3 times were very similar):

  • no trim – 0.39 seconds
  • ltrim() – 1.02 seconds
  • rtrim() – 2.66 seconds
  • trim() – 2.70 seconds.

Clearly that’s a lot of extra CPU on top of the base CPU cost. This is not entirely surprising since string operations tend to be expensive, neverthless the differences are large enough to be more than random fluctuations and operational error.

Remember that this is just two tables of 200 and 46,000 rows respectively. It turned out that the rowsources that Stefan was using were in the order of 800K and 2M rows – with CPU time increasing from 1,100 seconds to 2,970 seconds because of the trim().

So how many times was the trim() function called in total?

Faking it.

If we assume that the trim() built-in SQL function behaves in the same way as a deterministic PL/SQL function we can at least count the number of calls that take place by writing a deterministic function to put into the SQL. Something like:

create or replace package p1 as
        n1 number;
        function f1(v1 in varchar2) return varchar2 deterministic;
end;
/

create or replace package body p1 as 

        function f1 (v1 in varchar2)
        return varchar2 
        deterministic
        is
        begin
                p1.n1 := p1.n1 + 1;
                return trim(v1);
        end;

end;
/

set serveroutput off
alter session set statistics_level = all;

exec p1.n1 := 0

select
        count(*)
from
    tab1
inner join tab2 on
    tab1.owner = tab2.owner
and     p1.f1(tab1.object_name) > p1.f1(tab2.object_name)
-- and  p1.f1(tab1.object_name) > trim(tab2.object_name)
-- and  trim(tab1.object_name)  > p1.f1(tab2.object_name)
;

select * from table(dbms_xplan.display_cursor(format=>'projection allstats last'));

set serveroutput on
execute dbms_output.put_line(p1.n1);

I’ve created a package with a public variable n1 so that I can set it and read it from “outside”, then I’ve created (and lied about) a function that increments that variable and returns its input, claiming that it’s deterministic. Once I’ve got the package in place I’ve:

  • set the variable to zero
  • run a query that does one of
    • use my function twice
    • use my function once – on the build table
    • use my function once – on the probe table
  • report the execution plan with stats
  • print the value of the variable

The timings are not really important, but here’s the execution plan when I used the function on both sides of the inequality:

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:21.15 |    1513 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:21.15 |    1513 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |  23007 |    329K|00:00:21.13 |    1513 |  1335K|  1335K|  860K (0)|
|   3 |    TABLE ACCESS FULL| TAB1 |      1 |    200 |    200 |00:00:00.01 |       5 |       |       |          |
|   4 |    TABLE ACCESS FULL| TAB2 |      1 |  46014 |  46014 |00:00:00.02 |     937 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("TAB1"."OWNER"="TAB2"."OWNER")
       filter("P1"."F1"("TAB1"."OBJECT_NAME")>"P1"."F1"("TAB2"."OBJECT_NAME"))

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=0) COUNT(*)[22]
   2 - (#keys=1)
   3 - (rowset=256) "TAB1"."OWNER"[VARCHAR2,128], "TAB1"."OBJECT_NAME"[VARCHAR2,128]
   4 - "TAB2"."OWNER"[VARCHAR2,128], "TAB2"."OBJECT_NAME"[VARCHAR2,128]

Apart from the change of function name the plan is the same – although it now takes over 21 CPU seconds to complete, of which most of the time is probably building and tearing down the PL/SQL stack. The important figure, though is the number of function calls I saw recorded in p1.n1: it was a little over 10 million calls to generate the 329 thousand rows (A-Rows for the hash join).

When I ran the code with only one call to my deterministic function it was called 5 million times regardless of whether it was used for the build or probe table. Oracle did nothing to minimise the number of times the function was called.

Predictive Mode

Near the start of this note I showed you a little query to aggregate the rows of the two tables by owner; with a little enhancement I can reuse that code to show you exactly how many times the deterministic function was called:

select
        v1.owner, ct1, ct2, ct1 * ct2, sum(ct1 * ct2) over() tot_ct
from
        (select owner, count(object_name) ct1 from tab1 group by owner) v1,
        (select owner, count(object_name) ct2 from tab2 group by owner) v2
where
        v2.owner = v1.owner
/

OWNER                  CT1        CT2    CT1*CT2     TOT_CT
--------------- ---------- ---------- ---------- ----------
SYS                    128      40104    5133312    5160466
SYSTEM                  65        417      27105    5160466
OUTLN                    7          7         49    5160466

3 rows selected.

The number of comparisons done by the filter() predicate 5,160,466: double it to get the number of function calls. For every single one of the 40,104 SYS rows in tab2 the function was called for every single one of the SYS rows in tab1, for both sides of the inequality.

It’s a shame that Oracle doesn’t calculate and project the “virtual columns” that will be used in the join predicates, because in my case that would have reduced the number of calls from 10 million to 40,232 – a factor of roughly 250. That would probably be worth a lot of CPU to Stefan.

Damage Limitation

For my silly little query that went from 0.39 seconds to 2.70 seconds you might decide there’s no point in trying to improve things – in fact many of the sites I’ve visited probably wouldn’t even notice the CPU wastage (on one call); but when the query runs for 2,970 seconds and a little fiddling around shows that it could run in 1,100 seconds you might be inclined to see if there’s something you could do improve things.

Andrew Sayer suggested the possibility of rewriting the query with a pair of CTEs (“with” subqueries) that were forced to materialize the trim() in the CTE. The cost of physically creating the two large GTTs might well be much less than the CPU spent on the trim()ed join.

Alternatively – and dependent on the ownership and quality of the application – you could write a check constraint on each table to ensure that the column value was always equal to the trim() of the column value.

A similar option would be to add an (invisible) column to each table and use a trigger to populate the column with the trimmed value and then use the trimmed column in the query.

Conclusion

I don’t think that anything I’ve done or described in this note could be called rocket science (or telescope science as, perhaps, it should be in honour of Webb); but it has shown how much insight you can gain into what Oracle is doing and how you may be able to pin-point excess work using a few simple mechanisms that have been around for more than 15 years.

June 17, 2022

Case Study

Filed under: Execution plans,Oracle,Parallel Execution,Performance,Problem Solving,Troubleshooting,Tuning — Jonathan Lewis @ 1:25 pm BST Jun 17,2022

The question “How do you trouble-shoot a performance problem” came up in an online session I did for the AIOUG (All-India OUG) today. It’s a very broad question and the only possible answers are either extremely generic, or very specific – so here’s a specific example that I talked about to give some idea of the processes I go through. It’s an example from the Oracle-l list server asking the (paraphrased) question:

I’ve got a parallel query that take 24 seconds to return 2,500 rows for 2018. .The same query for 2019 should return a similar amount of data but consumes a lot of TEMP space before failing; it takes 45 minutes to complete if I remove the parallel hint. The only difference between the two queries is a change to a single predicate: “opclf.year_number = to_number(‘YYYY’)” and the statistics are up to date – what’s going on?

The ease of diagnosing a problem is dependent on the ease of access to all the right information, and you don’t always know initially what the right information might be. In this case the OP had started by posting to github the SQL Monitor reports for the two queries, which were automatically generated since the queries were using parallel execution.

Click here to show/hide the first few sections for the 2019 output
SQL Text
------------------------------
SELECT /*+ PARALLEL(8) */ DISTINCT rd.document_id AS doc_id ,'firm_id' AS criteria_key ,opf.ultimate_parent_firm_id AS series_id ,period_number FROM ( SELECT /*+ PARALLEL(8) */ DISTINCT rd.document_id ,rd.client_role_id ,( CASE WHEN MONTHS_BETWEEN(TO_DATE('04/28/2022', 'MM/DD/YYYY'), TO_DATE('04/01/2017', 'MM/DD/YYYY')) > 12 THEN TRUNC(dc.date_value, 'YEAR') ELSE TRUNC(dc.date_value, 'MONTH') END ) period_number FROM REPORT_ENGINE.date_code dc ,REPORT_ENGINE.lit_fact_bd rd INNER JOIN
report_engine.firm FIRM ON rd.firm_id = FIRM.firm_id WHERE dc.date_value BETWEEN TO_DATE('04/01/2017', 'MM/DD/YYYY') AND TO_DATE('04/28/2022', 'MM/DD/YYYY') AND rd.publication_date_id = dc.date_id AND rd.year_number = to_number('2019') AND (FIRM.ultimate_parent_firm_id IN (to_number('252094'))) ) rd INNER JOIN report_engine.opposing_counsel op ON rd.client_role_id = op.legal_role_id INNER JOIN report_engine.lit_fact_bd opclf ON opclf.document_id = rd.document_id AND op.opposing_counsel_role_id
= opclf.client_role_id AND opclf.year_number = to_number('2019') INNER JOIN report_engine.firm opf ON opclf.firm_id = opf.firm_id AND opf.firm_id >= 1000 WHERE 1 = 1

Global Information
------------------------------
 Status              :  EXECUTING
 Instance ID         :  1
 Session             :  SYS (647:28741)
 SQL ID              :  3cjg20q2vw41f
 SQL Execution ID    :  16777216
 Execution Started   :  06/09/2022 05:08:24
 First Refresh Time  :  06/09/2022 05:08:25
 Last Refresh Time   :  06/09/2022 05:08:27
 Duration            :  221s
 Module/Action       :  sqlplus@c111dhw (TNS V1-V3)/-
 Service             :  SYS$USERS
 Program             :  sqlplus@c111dhw (TNS V1-V3)

Global Stats
====================================================================
| Elapsed |   Cpu   |    IO    |  Other   | Buffer | Write | Write |
| Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Reqs  | Bytes |
====================================================================
|     222 |     179 |       28 |       15 |    49M | 11624 |   2GB |
====================================================================

Parallel Execution Details (DOP=8 , Servers Allocated=16)
=====================================================================================================================================
|      Name      | Type  | Server# | Elapsed |   Cpu   |    IO    |  Other   | Buffer | Write | Write |         Wait Events         |
|                |       |         | Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Reqs  | Bytes |         (sample #)          |
=====================================================================================================================================
| PX Coordinator | QC    |         |    0.26 |    0.26 |          |     0.00 |  12144 |       |     . |                             |
| p000           | Set 1 |       1 |         |         |          |          |        |       |     . |                             |
| p001           | Set 1 |       2 |         |         |          |          |        |       |     . |                             |
| p002           | Set 1 |       3 |         |         |          |          |        |       |     . |                             |
| p003           | Set 1 |       4 |         |         |          |          |        |       |     . |                             |
| p004           | Set 1 |       5 |         |         |          |          |        |       |     . |                             |
| p005           | Set 1 |       6 |         |         |          |          |        |       |     . |                             |
| p006           | Set 1 |       7 |         |         |          |          |        |       |     . |                             |
| p007           | Set 1 |       8 |         |         |          |          |        |       |     . |                             |
| p008           | Set 2 |       1 |     220 |     177 |       28 |       15 |    48M | 11624 |   2GB | direct path write temp (28) |
| p009           | Set 2 |       2 |         |         |          |          |        |       |     . |                             |
| p010           | Set 2 |       3 |         |         |          |          |        |       |     . |                             |
| p011           | Set 2 |       4 |    1.71 |    1.70 |          |     0.01 |   595K |       |     . |                             |
| p012           | Set 2 |       5 |         |         |          |          |        |       |     . |                             |
| p013           | Set 2 |       6 |         |         |          |          |        |       |     . |                             |
| p014           | Set 2 |       7 |         |         |          |          |        |       |     . |                             |
| p015           | Set 2 |       8 |         |         |          |          |        |       |     . |                             |
=====================================================================================================================================

You’ll note that I haven’t got as far as the body of the execution plan yet, and I’ve highlighted line 44 – a line in the middle of the summary of activity for the parallel execution processes. There are 8 servers in each of two sets (we’re running parallel 8) and the line I’ve highlighted is the first server of the second set. The reason I’ve highlighted it is that it’s the one server that’s doing almost all the work – effectively the query (at some point in the plan) is running serially.

So, a first thought, maybe we’ve just been unlucky and running parallel 7 (say) would result in a better distribution of data across parallel servers and allow each of 7 processes to do whatever they had to do to a much smaller amount of data. Maybe a change to the distribution method (pq_distribute() hint) would spread the workload more evenly. In either case “N” smaller workload might still use a lot of TEMP, but possibly no individual process would fail, and the job would complete nearly N times as quickly.

Note: the 2018 Monitor report showed an equivalent skew in the data distribution, but the amount of work needed was much smaller with a read/write load of only 143MB compared to 2GB for the 2019 report. The OP did try running parallel 7, but with no change to the overall effect.

Let’s take a quick glance at the plan body (click to view/hide)
SQL Plan Monitoring Details (Plan Hash Value=1862006233)
=========================================================================================================================================================================================================================
| Id    |                     Operation                      |           Name            |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Write | Write | Mem | Temp | Activity |        Activity Detail       |
|       |                                                    |                           | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs  | Bytes |     |      |   (%)     |         (# samples)         |
=========================================================================================================================================================================================================================
|     0 | SELECT STATEMENT                                   |                           |         |      |           |        |     3 |          |       |       |     |      |   |                             |
|     1 |   PX COORDINATOR                                   |                           |         |      |           |        |     3 |          |       |       |     |      |   |                             |
|     2 |    PX SEND QC (RANDOM)                             | :TQ10003                  |     781 |  153 |           |        |       |          |       |       |     |      |   |                             |
|     3 |     HASH UNIQUE                                    |                           |     781 |  153 |           |        |       |          |       |       |     |      |   |                             |
|     4 |      PX RECEIVE                                    |                           |     781 |  151 |           |        |       |          |       |       |     |      |   |                             |
|     5 |       PX SEND HASH                                 | :TQ10002                  |     781 |  151 |           |        |     2 |          |       |       |     |      |   |                             |
|  -> 6 |        BUFFER SORT                                 |                           |     781 |  153 |       219 |     +3 |     2 |        0 | 11624 |   2GB | 61M |   2G |    26.21 | Cpu (26)                     |
|       |                                                    |                           |         |      |           |        |       |          |       |       |     |      |   | direct path write temp (28) |
|  -> 7 |         NESTED LOOPS                               |                           |     781 |  151 |       219 |     +3 |     2 |      39M |       |       |     |      |   |                             |
|  -> 8 |          NESTED LOOPS                              |                           |     781 |  151 |       219 |     +3 |     2 |      39M |       |       |     |      |     0.49 | Cpu (1)                      |
|  -> 9 |           NESTED LOOPS                             |                           |     777 |   44 |       219 |     +3 |     2 |      39M |       |       |     |      |   |                             |
| -> 10 |            NESTED LOOPS                            |                           |      41 |   26 |       219 |     +3 |     2 |     6463 |       |       |     |      |   |                             |
| -> 11 |             HASH JOIN                              |                           |      41 |   21 |       219 |     +3 |     2 |     6463 |       |       |  6M |      |   |                             |
|    12 |              BUFFER SORT                           |                           |         |      |         1 |     +3 |     2 |    36855 |       |       |     |      |   |                             |
|    13 |               PX RECEIVE                           |                           |      87 |   19 |         1 |     +3 |     2 |    36855 |       |       |     |      |   |                             |
|    14 |                PX SEND HASH                        | :TQ10000                  |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    15 |                 NESTED LOOPS                       |                           |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    16 |                  NESTED LOOPS                      |                           |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    17 |                   TABLE ACCESS BY INDEX ROWID      | FIRM                      |       1 |    2 |         1 |     +3 |     1 |       43 |       |       |     |      |   |                             |
|    18 |                    INDEX RANGE SCAN                | FIRM_ULT_PARENT_FIRM_IDX1 |       1 |    1 |         1 |     +3 |     1 |       43 |       |       |     |      |   |                             |
|    19 |                   PARTITION RANGE SINGLE           |                           |         |      |         1 |     +3 |    43 |    38694 |       |       |     |      |   |                             |
|    20 |                    BITMAP CONVERSION TO ROWIDS     |                           |         |      |         1 |     +3 |    43 |    38694 |       |       |     |      |   |                             |
|    21 |                     BITMAP INDEX SINGLE VALUE      | LIT_FACT_BD_IDX09         |         |      |         1 |     +3 |    43 |       49 |       |       |     |      |   |                             |
|    22 |                  TABLE ACCESS BY LOCAL INDEX ROWID | LIT_FACT_BD               |      63 |   19 |         3 |     +1 | 38694 |    38694 |       |       |     |      |     0.49 | Cpu (1)                      |
|    23 |              PX RECEIVE                            |                           |      20 |    2 |         1 |     +3 |     2 |        2 |       |       |     |      |   |                             |
|    24 |               PX SEND HASH                         | :TQ10001                  |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
|    25 |                PX BLOCK ITERATOR                   |                           |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
|    26 |                 TABLE ACCESS FULL                  | OPPOSING_COUNSEL          |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
| -> 27 |             TABLE ACCESS BY INDEX ROWID            | DATE_CODE                 |       1 |      |       219 |     +3 |  6465 |     6463 |       |       |     |      |   |                             |
| -> 28 |              INDEX UNIQUE SCAN                     | PK_DATE_CODE              |       1 |      |       219 |     +3 |  6465 |     6465 |       |       |     |      |   |                             |
| -> 29 |            PARTITION RANGE SINGLE                  |                           |      19 |      |       219 |     +3 |  6465 |      39M |       |       |     |      |   |                             |
| -> 30 |             TABLE ACCESS BY LOCAL INDEX ROWID      | LIT_FACT_BD               |      19 |      |       220 |     +2 |  6465 |      39M |       |       |     |      |    35.92 | Cpu (74)                     |
| -> 31 |              INDEX RANGE SCAN                      | LIT_FACT_BD_IDX20         |       1 |      |       219 |     +3 |  6465 |      39M |       |       |     |      |     9.22 | Cpu (19)                     |
| -> 32 |           INDEX UNIQUE SCAN                        | PK_FIRM                   |       1 |      |       219 |     +3 |   39M |      39M |       |       |     |      |    10.68 | Cpu (22)                     |
| -> 33 |          TABLE ACCESS BY INDEX ROWID               | FIRM                      |       1 |      |       219 |     +3 |   39M |      39M |       |       |     |      |    16.99 | Cpu (35)                     |
===================================================================================================================================================================

You can see from the “->” symbols at the left hand side of the plan that this report was generated while the plan was still running. The thing that leaps out as you glance down the page is the value in the “Rows (Actual)” column at operations 7-9 (which show the rowsources generated by some nested loop joins) and operations 29, 32 and 33 of the plan that tell us something about how those rowsources were generated.

Operation 29 has executed (Execs) 6,465 so far, producing a total of 39M rows, and operations 32 and 33 have both executed 39M times each producing a total of 39M rows by index unique scan.

The plan for the 2018 data was similar though the join order for DATE_CODE, LIT_FACT_BD and FIRM was different (and it was the join to LIT_FACT_BD that increased the row count dramatically – so hinting it to be the last table in the join might help a bit), but the largest rowcount for the 2018 query was only 3M rows, not the 39M that had appeared after only 6,465 rows of a possible driving 39,855 in the 2019 query.

So it’s not surprising that the query could take much longer for 2019. It’s not the volume of output that matters, it’s the volume of input (or, more accurately, throughput or intermediate) data that matters.

Let’s think about that volume, though: the 2018 plan generated 3M rows and then crunched them down to 2,500 rows and the 2019 plan was supposed to produce a similar sized output (from 39M+ rows). Could we rewrite the query in some way that made it do some intermediate aggregation so that the volume of data to be aggregated was never enormous?

Let’s take a look at the plan from the 2018 Monitor report (click to show/hide)
SQL Plan Monitoring Details (Plan Hash Value=472871521)
=======================================================================================================================================================================================================================================
| Id |                     Operation                      |           Name            |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  | Write | Write |  Mem  | Temp  | Activity |       Activity Detail       |
|    |                                                    |                           | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Reqs  | Bytes | (Max) | (Max) |   (%)     |        (# samples)         |
=======================================================================================================================================================================================================================================
|  0 | SELECT STATEMENT                                   |                           |         |      |         1 |    +24 |    17 |     2613 |      |       |       |       |       |       |   |                            |
|  1 |   PX COORDINATOR                                   |                           |         |      |         1 |    +24 |    17 |     2613 |      |       |       |       |       |       |   |                            |
|  2 |    PX SEND QC (RANDOM)                             | :TQ10003                  |       1 |   39 |         1 |    +24 |     8 |     2613 |      |       |       |       |       |       |   |                            |
|  3 |     HASH UNIQUE                                    |                           |       1 |   39 |         9 |    +16 |     8 |     2613 |      |       |       |       |    9M |       |     6.90 | Cpu (2)                     |
|  4 |      PX RECEIVE                                    |                           |       1 |   38 |         9 |    +16 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  5 |       PX SEND HASH                                 | :TQ10002                  |       1 |   38 |        12 |    +14 |     8 |       3M |      |       |       |       |       |       |     3.45 | Cpu (1)                     |
|  6 |        BUFFER SORT                                 |                           |       1 |   39 |        23 |     +2 |     8 |       3M | 4584 | 143MB |   703 | 143MB |  151M |  151M |    34.48 | Cpu (2)                     |
|    |                                                    |                           |         |      |           |        |       |          |      |       |       |       |       |       |   | direct path read temp (6)  |
|    |                                                    |                           |         |      |           |        |       |          |      |       |       |       |       |       |   | direct path write temp (2) |
|  7 |         NESTED LOOPS                               |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  8 |          NESTED LOOPS                              |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  9 |           NESTED LOOPS                             |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
| 10 |            NESTED LOOPS                            |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
| 11 |             HASH JOIN                              |                           |      41 |   21 |        15 |     +2 |     8 |    19334 |      |       |       |       |    7M |       |   |                            |
| 12 |              BUFFER SORT                           |                           |         |      |        13 |     +2 |     8 |    19233 |      |       |       |       |    1M |       |   |                            |
| 13 |               PX RECEIVE                           |                           |      89 |   19 |        13 |     +2 |     8 |    19233 |      |       |       |       |       |       |   |                            |
| 14 |                PX SEND HASH                        | :TQ10000                  |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 15 |                 NESTED LOOPS                       |                           |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 16 |                  NESTED LOOPS                      |                           |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 17 |                   TABLE ACCESS BY INDEX ROWID      | FIRM                      |       1 |    2 |         1 |     +1 |     1 |       43 |      |       |       |       |       |       |   |                            |
| 18 |                    INDEX RANGE SCAN                | FIRM_ULT_PARENT_FIRM_IDX1 |       1 |    1 |         1 |     +1 |     1 |       43 |      |       |       |       |       |       |   |                            |
| 19 |                   PARTITION RANGE SINGLE           |                           |         |      |         1 |     +1 |    43 |    19233 |      |       |       |       |       |       |   |                            |
| 20 |                    BITMAP CONVERSION TO ROWIDS     |                           |         |      |         1 |     +1 |    43 |    19233 |      |       |       |       |       |       |   |                            |
| 21 |                     BITMAP INDEX SINGLE VALUE      | LIT_FACT_BD_IDX09         |         |      |         1 |     +1 |    43 |       51 |      |       |       |       |       |       |   |                            |
| 22 |                  TABLE ACCESS BY LOCAL INDEX ROWID | LIT_FACT_BD               |      64 |   19 |         1 |     +1 | 19233 |    19233 |      |       |       |       |       |       |   |                            |
| 23 |              PX RECEIVE                            |                           |      20 |    2 |        15 |     +2 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 24 |               PX SEND HASH                         | :TQ10001                  |      20 |    2 |         1 |    +14 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 25 |                PX BLOCK ITERATOR                   |                           |      20 |    2 |         1 |    +14 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 26 |                 TABLE ACCESS FULL                  | OPPOSING_COUNSEL          |      20 |    2 |         1 |    +14 |     3 |       20 |      |       |       |       |       |       |   |                            |
| 27 |             PARTITION RANGE SINGLE                 |                           |       1 |      |        15 |     +2 | 19334 |       3M |      |       |       |       |       |       |   |                            |
| 28 |              TABLE ACCESS BY LOCAL INDEX ROWID     | LIT_FACT_BD               |       1 |      |        16 |     +1 | 19334 |       3M |      |       |       |       |       |       |    17.24 | Cpu (5)                     |
| 29 |               INDEX RANGE SCAN                     | LIT_FACT_BD_IDX20         |       1 |      |        15 |     +2 | 19334 |       3M |      |       |       |       |       |       |   |                            |
| 30 |            TABLE ACCESS BY INDEX ROWID             | DATE_CODE                 |       1 |      |        15 |     +2 |    3M |       3M |      |       |       |       |       |       |    10.34 | Cpu (3)                     |
| 31 |             INDEX UNIQUE SCAN                      | PK_DATE_CODE              |       1 |      |        16 |     +1 |    3M |       3M |      |       |       |       |       |       |     6.90 | Cpu (2)                     |
| 32 |           INDEX UNIQUE SCAN                        | PK_FIRM                   |       1 |      |        23 |     +2 |    3M |       3M |      |       |       |       |       |       |     6.90 | Cpu (2)                     |
| 33 |          TABLE ACCESS BY INDEX ROWID               | FIRM                      |       1 |      |        16 |     +1 |    3M |       3M |      |       |       |       |       |       |    13.79 | Cpu (4)                     |
=======================================================================================================================================================================================================================================

We see from operations 3 – 7 that the 3M rows generated from the nested loop joins pass up through a buffer sort operation before being crunched down to 2,613 rows. It’s probably the buffer sort (which is buffering but not sorting) that has mostly passed through a single server and spilled to disc in the 2019 report. We just don’t want that 39M+ rows ever to exist.

So how easy will it be to change the SQL (click to view/hide)
SELECT
        /*+ PARALLEL(8) */
        DISTINCT rd.document_id AS doc_id
        ,'firm_id' AS criteria_key
        ,opf.ultimate_parent_firm_id AS series_id
        ,period_number
FROM (
        SELECT
                /*+ PARALLEL(8) */
                DISTINCT rd.document_id
                ,rd.client_role_id
                ,(
                        CASE 
                                WHEN MONTHS_BETWEEN(TO_DATE('04/28/2022', 'MM/DD/YYYY'), TO_DATE('04/01/2017', 'MM/DD/YYYY')) > 12
                                        THEN TRUNC(dc.date_value, 'YEAR')
                                ELSE TRUNC(dc.date_value, 'MONTH')
                                END
                        ) period_number
        FROM REPORT_ENGINE.date_code dc
                ,REPORT_ENGINE.lit_fact_bd rd
        INNER JOIN report_engine.firm FIRM ON rd.firm_id = FIRM.firm_id
        WHERE dc.date_value BETWEEN TO_DATE('04/01/2017', 'MM/DD/YYYY')
                        AND TO_DATE('04/28/2022', 'MM/DD/YYYY')
                AND rd.publication_date_id = dc.date_id
                AND rd.year_number = to_number('2019')
                AND (FIRM.ultimate_parent_firm_id IN (to_number('252094')))
        ) rd
INNER JOIN report_engine.opposing_counsel op ON rd.client_role_id = op.legal_role_id
INNER JOIN report_engine.lit_fact_bd opclf ON opclf.document_id = rd.document_id
        AND op.opposing_counsel_role_id = opclf.client_role_id
        AND opclf.year_number = to_number('2019')
INNER JOIN report_engine.firm opf ON opclf.firm_id = opf.firm_id
        AND opf.firm_id >= 1000
WHERE 1 = 1;

Lines 7-10 and 27 tell us we alredy have an inline view where we’re doing a “select distinct” and, unwinding the mix of “Oracle” and “ANSI” syntax, we can see that it joins DATE_CODE, LIT_FACT_BD and FIRM, and we know that one of those tables explodes the intermediate data size to something enormous. So it looks like the original author of this code had already worked out that the query needed to aggregate early.

Checking back to the original plans we note that there’s only one “hash unique” operation, and there’s no sign of a “view” operation, so maybe the performance problem is a result of the optimizer suddenly deciding it can do complex view merging with this inline view, and perhaps all we need to do is add the hint /*+ no_merge */ to the inline view and see what happens.

Here’s the plan after adding the hint (click to hide/vew)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----
| Id  | Operation                                            | Name                      | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----
|   0 | SELECT STATEMENT                                     |                           |       |       |   314 (100)|          |       |       |        |      | |
|   1 |  PX COORDINATOR                                      |                           |       |       |            |          |       |       |        |      | |
|   2 |   PX SEND QC (RANDOM)                                | :TQ10005                  |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | P->S | QC (RAN D)  |
|   3 |    HASH UNIQUE                                       |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | PCWP | |
|   4 |     PX RECEIVE                                       |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | PCWP | |
|   5 |      PX SEND HASH                                    | :TQ10004                  |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,04 | P->P | HASH |
|   6 |       HASH UNIQUE                                    |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   7 |        NESTED LOOPS                                  |                           |    23 |  2599 |   313   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   8 |         NESTED LOOPS                                 |                           |    23 |  2599 |   313   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   9 |          NESTED LOOPS                                |                           |    23 |  2323 |   310   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|* 10 |           HASH JOIN                                  |                           |   388 | 21340 |   148   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  11 |            PX RECEIVE                                |                           |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,04 | PCWP | |
|  12 |             PX SEND BROADCAST                        | :TQ10002                  |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | P->P | BROADCA ST  |
|  13 |              PX BLOCK ITERATOR                       |                           |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | PCWC | |
|* 14 |               TABLE ACCESS FULL                      | OPPOSING_COUNSEL          |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | PCWP | |
|  15 |            VIEW                                      |                           |   835 | 39245 |   146   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  16 |             HASH UNIQUE                              |                           |   835 | 63460 |   146   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  17 |              PX RECEIVE                              |                           |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  18 |               PX SEND HASH                           | :TQ10003                  |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,03 | P->P | HASH |
|* 19 |                HASH JOIN BUFFERED                    |                           |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,03 | PCWP | |
|  20 |                 BUFFER SORT                          |                           |       |       |            |          |       |       |  Q1,03 | PCWC | |
|  21 |                  PX RECEIVE                          |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |  Q1,03 | PCWP | |
|  22 |                   PX SEND HASH                       | :TQ10000                  |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        | S->P | HASH |
|  23 |                    NESTED LOOPS                      |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        |      | |
|  24 |                     NESTED LOOPS                     |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        |      | |
|  25 |                      TABLE ACCESS BY INDEX ROWID     | FIRM                      |     1 |    12 |     2   (0)| 00:00:01 |       |       |        |      | |
|* 26 |                       INDEX RANGE SCAN               | FIRM_ULT_PARENT_FIRM_IDX1 |     1 |       |     1   (0)| 00:00:01 |       |       |        |      | |
|  27 |                      PARTITION RANGE SINGLE          |                           |       |       |            |          |    30 |    30 |        |      | |
|  28 |                       BITMAP CONVERSION TO ROWIDS    |                           |       |       |            |          |       |       |        |      | |
|* 29 |                        BITMAP INDEX SINGLE VALUE     | LIT_FACT_BD_IDX09         |       |       |            |          |    30 |    30 |        |      | |
|* 30 |                     TABLE ACCESS BY LOCAL INDEX ROWID| LIT_FACT_BD               |   598 | 30498 |   136   (0)| 00:00:03 |    30 |    30 |        |      | |
|  31 |                 PX RECEIVE                           |                           |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,03 | PCWP | |
|  32 |                  PX SEND HASH                        | :TQ10001                  |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | P->P | HASH |
|  33 |                   PX BLOCK ITERATOR                  |                           |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | PCWC | |
|* 34 |                    TABLE ACCESS FULL                 | DATE_CODE                 |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | PCWP | |
|  35 |           PARTITION RANGE SINGLE                     |                           |     1 |    46 |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 36 |            TABLE ACCESS BY LOCAL INDEX ROWID         | LIT_FACT_BD               |     1 |    46 |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 37 |             INDEX RANGE SCAN                         | LIT_FACT_BD_IDX20         |     1 |       |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 38 |          INDEX UNIQUE SCAN                           | PK_FIRM                   |     1 |       |     0   (0)|          |       |       |  Q1,04 | PCWP | |
|  39 |         TABLE ACCESS BY INDEX ROWID                  | FIRM                      |     1 |    12 |     0   (0)|          |       |       |  Q1,04 | PCWP | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----

Note particularly that operations 15 and 16 tell us that we’ve forced the optimizer into treating the inline view as a separate query block and we now have two aggregation steps, one inside the view, and another after joining FIRM (again) and LIT_FACT_BD (again) to the inline view.

Unfortunately the plan shown here is pulled from memory using dbms_xplan.display_cursor() after execution, so it include the various parallel executoin colums (TQ, IN-OUT, PQ Distrib), but doesn’t have the rowsource execution stats enabled so we can’t see the actual workload and data volume, but in the words of the OP: “adding no_merge hint did the trick and now the SQL is just executing fine”.

Summary

The steps for solving the performance problems of a specific SQL statement are very fluid. For a long-running or parallel statement the SQL Monitor report will automatically be created (though there are limits on the size of the plan that may disable the feature) and this is the easiest source of useful information, though you might also need to pull the execution plan from v$sql_plan to get details about parallel execution and partitioning at the same time.

If you’re not licensed for the diagnostic and performance packs, though, enabling SQL Trace to get the plan and waits gets you a lot of infomation, and querying (g)v$pq_tqstat immediately after running the query can fill in the parallel traffic details.

In this example the SQL Monitor report showed a highly skewed distribution, which might have been fixable by changing the PQ distribution, or even by simply changing the degree of parallelism.

Further examination of the report showed that the query generated an enormous rowsource which it then crunched down to a small result set. Comparing the 2018 and 2019 plans (which were not quite identical but were sufficiently similar to justify comparison) the same skew and explosion of rowsource were visible in both, though the data size involved in the 2018 plan made it seem that the plan was a “good” one which really it wasn’t.

The obvious target for tuning was to see if the explosion in volume could be reduced or eliminated by writing the query with some intermediate non-mergeable view(s), and it turned out that the query had been written with that intent in its original form but without a hint to block complex view merging. After adding the hint the performance was acceptable.

May 5, 2022

Lag/Lead slow

Filed under: Bugs,Oracle,Performance,Problem Solving,Troubleshooting,Tuning — Jonathan Lewis @ 10:05 am BST May 5,2022

This note is about a surprising performance difference between the lead() and lag() analytic functions (which31 turns out to be due to the behaviour of the nth_value() function) when the option to “ignore nulls” is included in their use (jump to conclusion). The detail I’ll be writing about was highlighted in a thread on the Oracle developer forum about a requirement to add a number of analytic columns to a dataset of 156 million rows using a statement of the following shape:

create table tb_target_ignore
as
select
        pat_id,
        visit_date_dt,
        ed_ucc_dt,
        lag (ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as prev_ed_ucc_dt,
        lead(ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as next_ed_ucc_dt,
        row_number() over (partition by pat_id order by visit_date_dt) as row_num
from
        tb_source
;

You’ll notice that I’ve introduced a row_number(), and both a lead() and a lag() of a column called ed_ucc_dt. All three analytic columns use the same partitioning and ordering, though, so Oracle will only be doing one “window sort” in the execution plan. Part of the performance problem, of course, was that with 156M rows of a couple of dozen existing columns and adding a dozen new columns, the workload due to sorting was bound to be very large, so there were various suggestions of how to minimise that part of the workload.

However Solomon Yakobson pointed out that the code was using the “ignore nulls” option and there was a bug in 11g that made lead() and lag() very slow when this option was used. He subsequently reported that this defect was still present in 19c, restricted it to just the lag() function, and cited a MOS document ID referencing the problem: LAG Function is slow when using ignore nulls (Doc ID 2811596.1). The implication of the MOS note is that we shouldn’t expect this to change.

A follow-up posting by User_H3J7U gave us a reason for the slowness of the lag() function by running a sample query through dbms_utility.expand_sql(). Oracle rewrites the query to use variants of the nth_value() function when you use “ignore nulls”, but rewrites it to use variants of first_value() when you aren’t using the “ignore nulls” option. This isn’t a complete explanation of why lag() should be slow while lead() is not – but it’s a significant pointer towards a possible implementation issue and is a good clue about working around the problem. So let’s build a model of the situation.

The basic model

rem
rem     Script:         lag_ignore_nulls.sql
rem     Author:         Jonathan Lewis / Sam P
rem     Dated:          May 2022
rem     Purpose:     
rem
rem     Last tested
rem             19.11.0.0
rem

create table tb_source (
        pat_id,
        visit_date_dt,
        ed_ucc_dt
)
as
with generator as (
        select rownum id
        from    dual
        connect by
                level <= 1e4    --> comment to avoid wordpress format issue
)
select
        g1.id,
        to_date('01-Apr-2022') + dbms_random.value(0,100),
        to_date('01-Apr-2022') + dbms_random.value(5,105)
--      to_date(null)
from
        generator g1,
        generator g2
where
        g2.id <= 20     --> comment to avoid wordpress format issue
order by
        dbms_random.value
/


spool lag_ignore_nulls.lst

set serveroutput off
set timing on

prompt  ======================
prompt  Without "ignore nulls"
prompt  (My time 0.61 seconds)
prompt  ======================

create table tb_target_no_ignore
as
select
        pat_id,
        visit_date_dt,
        ed_ucc_dt,
        lag (ed_ucc_dt, 1) over (partition by pat_id order by visit_date_dt) as prev_ed_ucc_dt,
        lead(ed_ucc_dt, 1) over (partition by pat_id order by visit_date_dt) as next_ed_ucc_dt,
        row_number() over (partition by pat_id order by visit_date_dt) as row_num
from
        tb_source
;

prompt  ======================
prompt  With "ignore nulls"
prompt  (My time 0.88 seconds)
prompt  ======================

create table tb_target_ignore
as
select
        pat_id,
        visit_date_dt,
        ed_ucc_dt,
        lag (ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as prev_ed_ucc_dt,
        lead(ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as next_ed_ucc_dt,
        row_number() over (partition by pat_id order by visit_date_dt) as row_num
from
        tb_source
;

I’ve created a source table with 200,000 rows, consisting of 10,000 pat_id values, and 20 rows per pat_id. The 20 rows for a pat_id (probably) each have a different visit_date_dt and a different ed_ucc_dt.

After creating the data set I’ve created two more tables using the lead() and lag() functions to generate a previous (lag) and next (lead) ed_ucc_dt for each row, partitioning by pat_id, ordering by visit_date_dt. One statement includes the “ignore nulls” option the other doesn’t and, as you can see, the time to create the “no ignore” table was 0.61 seconds while the time to create the “ignore null” table was 0.88 seconds.

The variation isn’t dramatic – but this is just 200,000 rows, in memory, with only a few columns and only two columns added through lead and lag.

After the baseline test I tweaked the statement that created the table with the “ignore nulls” option to get three more times.

  • With neither lead() nor lag() the time was 0.29 seconds
  • With just the lead() column the time was 0.46 seconds – an increase of 0.17 seconds
  • With just the lag() column the time was 0.71 seconds – an increase of 0.42 seconds

You might note that 0.29 + 0.17 + 0.42 = 0.88 (the time I got for adding both columns) – it’s a little lucky that it looks like a perfect match, but even matching within a couple of hundredths of a second would be have been a nice detail. It certainly seems that lag() – with my test data – consumes more resources than lead() for a pair of operationd that look as if they should produce the same workloads.

Internal Rewrite

The next step was to check what the internal rewrite of the code looked like, so I passed the select part of the statements (the procedure won’t accepts “create as select”) through dbms_utility.expand_sql() and reformatted the results. Here are the two rewritten statements – first without “ignore nulls”:

select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
        decode(
                count(*) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 preceding and 1 preceding
                        ),a
                 1,     first_value(a1.ed_ucc_dt) over (
                                partition by a1.pat_id order by a1.visit_date_dt
                                rows between 1 preceding and 1 preceding
                        ),
                        null
        ) prev_ed_ucc_dt,
        decode(
                count(*) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 following and 1 following
                        ),a
                 1,     first_value(a1.ed_ucc_dt) over (
                                partition by a1.pat_id order by a1.visit_date_dt
                                rows between 1 following and 1 following
                        ),
                        null
        ) next_ed_ucc_dt,
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1


The code looks a little long and messy, but that’s mainly because just about everything it does happens twice. The lag() function (prev column) turns into a first_value() function that looks at the row preceding the current row in the partition (rows between 1 preceding and 1 preceding). However it first has to count over the same clause to see if a row exists, and then either report its value or report a null – hence the structure decode(count(), 1, first_value(), null)

Note: the full lag() function call is: “lag(expression, offset, default)” where the offset (number of rows to lag) defaults to 1 and the default is the value you want reported when there is no matching row, and defaults to null.

The call to lead() basically does the same thing, but uses (rows between 1 following and 1 following) to access the next row in the partition.

On the other hand this is the SQL that Oracle generates when we include the “ignore nulls” clause (which means Oracle can’t restrict the row range to just one preceding or following row):

select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
        nvl(
                nth_value(a1.ed_ucc_dt, 1) from last ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between unbounded preceding and 1 preceding
                ),
                 null
        ) prev_ed_ucc_dt,
        nvl(
                nth_value(a1.ed_ucc_dt, 1)           ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 following and unbounded following
                ),
                null
        ) next_ed_ucc_dt,
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1

Both lag() and lead() turn into nth_value() with a second parameter of 1 (i.e. nth == 1st … which makes you wonder why Oracle isn’t using first_value()), and we can also see the “ignore nulls” still being used.

The lag() call now uses the range (rows between unbounded preceding and 1 preceding) i.e. everything from the start of partition to the previous row, while the lead() call uses the range (rows between 1 following and unbounded following) i.e. from the next row to the end of partition.

The other important detail to note is that the translation of the lag() call also includes the clause “from last” – in other words we want the first row when reading the partition in reverse order, and that might have something to do with the extra time it takes to operate the (translated) lag() function.

Workaround

Oracle is using a generic nth_value() to translate a generic lead()/lag(), but we’re in the special case where we know n = 1, which means we (and Oracle) could use first_value()/last_value(). It’s perfectly reasonable for Oracle’s internal code to avoid special cases if it makes no difference to performance, of course, but maybe in this case we could imitate Oracle’s rewrite to get some benefit.

  • Step 1 – change nth_value() to the appropriate first/last.
  • Step 2 – get rid of the “from last” which won’t be needed with last_value()
  • Step 3 – move the “ignore nulls” to the spot that Oracle wants to see it with first/last

Here’s the resulting SQL – I’ve left the nvl(count, expression, null) in place, but if you wanted a null as the default return value for the original lead()/lag() calls you could simplify the code a little further.

create table tb_target 
as
select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
        nvl(
                last_value(a1.ed_ucc_dt ignore nulls) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between unbounded preceding and 1 preceding
                ),
                 null
        ) prev_ed_ucc_dt,
        nvl(
                first_value(a1.ed_ucc_dt ignore nulls) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 following and unbounded following
                ),
                null
        ) next_ed_ucc_dt,
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1
/

  • Run-time: 0.61 seconds.
  • Run-time with just last_value() / preceding: 0.47 seconds
  • Run time with just first_value() / following: 0.43 seconds

There still seems to be a little discrepancy between accessing to the preceding data compared to accessing the following data but there’s a much better balance than before.

One more edit – taking out the nvl() construct because the original lead()/lag() calls didn’t have a non-null default supplied:

create table tb_target 
as
select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
--
        last_value(a1.ed_ucc_dt ignore nulls) over (
                partition by a1.pat_id order by a1.visit_date_dt
                rows between unbounded preceding and 1 preceding
        ) prev_ed_ucc_dt,
--
        first_value(a1.ed_ucc_dt ignore nulls) over (
                partition by a1.pat_id order by a1.visit_date_dt
                rows between 1 following and unbounded following
        ) next_ed_ucc_dt,
--
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1
/

With this simplification the time dropped by a further couple of hundredths of a second hovering between 0.57 and 0.58 seconds.

There was one last detail about the test code that should be mentioned – do the changes in code still produce the same results? As a quick and dirty check I ran the following query after each variant of creating the tb_target table (tb_target_ignore is the table created using the original lead()/lag() code with “ignore nulls”):

select  *
from   (
        select * from tb_target minus select * from tb_target_ignore
        union all
        select * from tb_target_ignore minus select * from tb_target
        )
/

The result was always zero rows.

Hypothesis

I raised the idea that the difference in timing for the lead() and lag() functions might have something to do with the volume of data that Oracle could be processing to find the one row it needed.

My thinking was that for the lead() rewrite – the call to nth_value(ed_ucc_dt,1) – would simply be looking at the next row in the partition (if it existed) because my data has no nulls that neeed to be ignored, while the rewrite of the lag() function with its “from last” requirement could be making Oracle re-read the entire preceding section of the partition before starting to process it backwards.

As a possible check to see if this was a viable hypothesis I ran one more test – visible in the initial declaration of tb_source – I created the data with ed_ucc_dt set to null in every single row, so that Oracle would be forced to process from the current position to whichever end of the partition was relevant regardless of whether it was calling lead() or lag().

With this change in place the timing for the lead() only and lag() only statements were nearly identical – which is a weak support for the hypothesis.

And once I’d done that test the next obvious test was to see what happened if I increased size of each partition (using non-null values for ed_ucc_dt) to see if larger partitions would increase the difference between the forward and backward tests. To do this I changed the script to create the tb_source table to produce 5,000 pat_id value with 40 rows per pat_id by changing the where clause to:

where
        g2.id <= 40     --> comment to avoid wordpress format issue
and     g1.id <= 5e3    --> comment to avoid wordpress format issue

With this change in place the timings for the original form of the lead()/lag() statement were:

  • With both lead() and lag() in place the time was 1.05 seconds
  • With neither lead() nor lag() the time was 0.25 seconds
  • With just the lead() column the time was 0.41 seconds – an increase of 0.16 seconds
  • With just the lag() column the time was 0.98 seconds – an increase of 0.73 seconds

So the lag() time (ballpark figures) nearly doubles as the partition size doubles but the lead() time stays pretty much the same.

The results of these two tests do tend to suggest that the generic nth_value() implementation can do some short-circuiting when working “forwards”, using a mechanism that isn’t available when the “from last” clause requires it to work “backwards”.

Writing the previous paragraph prompted me to do one last test – it wouldn’t produce the same results, of course, but I ought to check the performance when I moved the “from last” clause out of the “prev”/lag() column expression into the “next”/lead() column expression in Oracle’s original translation to confirm that the problem was associated with the “from last” and not with the choice of “preceding” or “following” in the row range section of the over() clause. (It was the “from last” that made the difference.)

tl;dr

If you’re using the lag() or lead() functions with “ignore nulls” on a very large dataset you may find that you can rewrite the code with first_value() or last_value() calls that use less CPU. The benefit is probably only significant on fairly large data sets, and may be particularly noticeable for cases where the same over() clause is used many times and the partition sizes are more than a couple of dozen rows each.

The potential for excess CPU usage comes from the effect of a generic internal rewrite using the nth_value() function with the “from last” clause even for the special case where your lead()/lag() have an offset (i.e. n) of 1 which would allow for the use of first_value()/last_value().

To get the correct rewrite you can use dbms_utility.expand_sql() to generate a suitable statement from which you can extract and edit the relevant pieces of text.

Footnote

A further point raised by Solomon Jakobson was that in the generic case (where you can’t use first_value() / last_value() to rewrite the SQL because you want to use an offset greater than one) it’s worth noting that the following two expressions are effectively identical:

        nvl(
                nth_value(a1.ed_ucc_dt, 2) from last ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between unbounded preceding and 1 preceding
                ),
                 null
        ) prev_ed_ucc_dt

        nvl(
                nth_value(a1.ed_ucc_dt, 2)           ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt desc
                        rows between 1 following and unbounded following
                ),
                 null
        ) prev_ed_ucc_dt
/

Note particularly that to avoid the issue of scanning the data “from last”, I’ve changed the ordering (in line 11) to descending and I’ve changed the preceding of line 4 to following in line 12 while reversing the positions of 1 and unbounded. (The same strategy can be applied for the lag() function in the original code).

In general it’s likely that lead() is will be more efficient than lag() so this change could make a big difference in performance. It’s worth noting, however, that if you’ve been using both lead() and lag() as my example did then your execution plan will (truthfully) show two “window sort” operations and the second (i.e. lower number in the plan) window sort will also be sorting the columns added by the first window sort. Depending on the nature of your data the additional sort might increase the workload by more than the benefit you get from eliminating a lag().

In an example created by Solomon Jakobson the effect of using lag( ,5) was catastrophic and the change that introduced the extra sort to get a lead( ,5) made a huge difference (40 seconds down to 0.15 seconds – and I haven’t made a mistake in the decimal point there); in my original test case the improvement I got from applying the same strategy was small (0.88 seconds down to 0.86 seconds). The critical difference between the test cases was that one had a single partition of 20,000 rows, the other had (10,000) partitions of only 20 rows.

(Footnote to the footnote – it did cross my mind that if you had a performance catastrophe with the simple lag(,n) and with the extra sort from reversing the order to use a lead(,n) would Oracle do something clever if you found a way to rewrite the query with a cascade of inline-views that all used last_value() – but I decided I wasn’t going to look at that option until someone paid me to. More seriously I did wonder about the option for using match_recognize to bypass the lag() problem, and may get around to thinking about that more carefully at some point.)

April 7, 2022

Parallel Inactivity

Filed under: Oracle,Parallel Execution,Performance,Troubleshooting — Jonathan Lewis @ 5:08 pm BST Apr 7,2022

This is a simple script that I wrote more than 20 years ago (for 8.1.7.4 apparently) to get the session wait event figures from parallel query slaves as they were running. I’m posting it now because there’s a question on the Oracle Database Forum where it would be useful as a way for checking for performance problems in a long-running CTAS.

Oracle gives us the view v$px_session to link together the query co-ordinator with the parallel query slaves, and it gives us v$px_sesstat to report the linked session activity stats, but it doesn’t gives us a view to link together the session events for the multiple sessions, so this query simply joins v$px_session to v$session_events to produce that result:

rem
rem     Script:         px_waits.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2001
rem     Purpose:        Report events for PX execution
rem
rem     Last tested:
rem             19.3.0.0
rem
rem     Notes:
rem     Note that the co-ordinator can be on a different instance
rem     in parallel server systems.  To cater for an oddity of the
rem     qcinst report, we need the current instance number in case
rem     the QC is on the current machine.
rem
rem     If the co-ordinator is present, then it appears last on the list
rem     for a server group.
rem

column instance_number new_value m_inst
select instance_number from v$instance;

set linesize 120
set pagesize 60
set trimspool on
set tab off
set verify off

spool px_waits

break -
        on qcsid skip 1 -
        on server_group -
        on degree -
        on server_set -
        on sid -
        on server# skip 1

column event format a32

column  qcsid           format a6       heading 'Coord'
column  server_group    format 999      heading 'Grp'
column  degree          format a5       heading 'Deg'   noprint
column  server_set      format 999      heading 'Set'   noprint
column  server#         format 999      heading 'Sno'
column  sid             format 9999     heading 'SID'
column  name            format a32      heading 'Statistic'
column  value           format 99,999,999       heading 'value'

select
        ss.qcsid || '/' || nvl(ss.qcinst_id,&m_inst)            qcsid,
        ss.server_group,
        decode(degree,
                null,null,
                ss.degree || '/' || ss.req_degree
        )       degree,
        ss.server_set,
        ss.server#,
        ss.sid,
        se.event,
        se.total_waits,
        se.time_waited
from
        v$px_session            ss,
        v$session_event         se
where
        se.sid = ss.sid
order by
        ss.qcsid,
        ss.server_group,
        ss.server_set,
        ss.server#,
        se.event
;

clear columns
clear breaks
set verify on

spool off

See also: Parallel Activity for the session activity stats report.

Parallel Activity

Filed under: Oracle,Parallel Execution,Performance,Troubleshooting — Jonathan Lewis @ 4:49 pm BST Apr 7,2022

This is a simple script that I wrote more than 20 years ago (for 8.1.7.4 apparently) to get the session activity from parallel query slaves as they were running. I’m posting it now because there’s a question on the Oracle Database Forum where it would be useful as a way for checking for a skewed data distribution in a long-running query.

Oracle gives use the view v$px_sesstat which connects PX slaves with their query co-ordinator (QC) and reports the session activity for all of them. The view only report statistics numbersm though, so this little query joins the view to v$statname to report the names. It eliminates stats where the value is zero, and orders by QC (though you could add a line to restrict the query to a single QC), parallel server group, then process, then statistic number.

rem
rem     Script:         px_stats.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2001
rem     Purpose:        Report statistics for PX slaves on the fly
rem
rem     Last tested
rem             19.3.0.0
rem     Notes:
rem     Note that the co-ordinator can be on a different instance
rem     in parallel server systems.  To cater for an oddity of the
rem     qcinst report, we need the current instance number in case
rem     the QC is on the current machine.
rem
rem     If the co-ordinator is present, then it appears last on the list
rem     for a server group.
rem
rem     This code really needs to be enhanced to do a proper job on OPS/RAC
rem     by accessing gv$px_sesstat, and handling sids and instances correctly
rem
rem     The user running this query has to have SELECT privileges 
rem     on the views v$instance, v$statname and v$px_sesstat
rem

column instance_number new_value m_inst
select instance_number from v$instance;

spool px_stats

set tab off
set pagesize 60
set linesize 156
set trimspool on
set verify off

break -
        on qcsid skip 1 -
        on server_group -
        on degree -
        on server_set -
        on sid -
        on server# skip 1


column  qcsid           format a6               heading 'Coord'
column  server_group    format 999              heading 'Grp'
column  degree          format a5               heading 'Deg'
column  server_set      format 999              heading 'Set'
column  server#         format 999              heading 'Sno'
column  sid             format 9999             heading 'SID'
column  name            format a52              heading 'Statistic'
column  value           format 99,999,999,999   heading 'value'


select
        st.qcsid || '/' || nvl(st.qcinst_id,&m_inst)            qcsid,
        st.server_group,
        decode(degree,
                null,null,
                st.degree || '/' || st.req_degree
        )       degree,
        st.server_set,
        st.server#,
        st.sid,
        sn.name,
        st.value
from
        v$px_sesstat    st,
        v$statname      sn
where
        sn.statistic# = st.statistic#
and     st.value != 0
order by
        st.qcsid,
        st.server_group,
        st.server_set,
        st.server#,
        st.statistic#
;

clear columns
clear breaks
set verify on

spool off

See also: Parallel Inactivity for a simple script reporting wait events

February 21, 2022

CTE Enhancement

Filed under: 12c,Oracle,Performance,Transformations — Jonathan Lewis @ 9:24 am GMT Feb 21,2022

For many years I’ve told people that when you materialize a CTE (common table expression / “with” subquery) the result set will be written to the temporary tablespace using direct path writes and will be read back using cached reads. This stopped being an accurate description in 12c.

There is a clue about this in the way that the corresponding execution plans and I’ll be pointing that out later. The key difference between earlier versions of Oracle and newer versions is that the GTT (global temporary table) that holds the materialized result set is not necessarily written to disc, and may even avoid allocating a temporary segment.

I started looking at this feature a couple of days ago after remembering that someone, somewhere, had mentioned some details about a temporary object being kept in the PGA rather than being written to disc if the size wasn’t too big. I couldn’t remember if this was GTTs or temporary LOBs (or something completely different) and I only had a vague memory that there was a critical size that might have been 256KB; so I started experimenting with materializing CTEs.

Here’s the script I used to establish a baseline in 11g. I took a fairly arbitrary starting guess that if there was a PGA/Temp split is might be around 64KB.

rem
rem     Script:         cte_writes.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2022
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem             12.2.0.1
rem             11.2.0.4
rem

alter session set events '10046 trace name context forever, level 8';

prompt  ==============================
prompt  First sample - just under 64KB
prompt  ==============================

set arraysize  35
set pagesize  35

set pause Waiting...
set pause on

execute snap_ts.start_snap
execute snap_my_stats.start_snap

with g1 as (
        select  /*+ materialize */
                lpad(rownum,1024)       v1
        from
                dual
        connect by
                level <= 63
)
select
        substr(v1,-10)
from
        g1
;

alter session set events '10046 trace name context off';

execute snap_my_stats.end_snap
execute snap_ts.end_snap

prompt  ==============================
prompt  Second sample - just over 64KB
prompt  ==============================

execute snap_ts.start_snap
execute snap_my_stats.start_snap

with g1 as (
        select  /*+ materialize */
                lpad(rownum,1024)       v1
        from
                dual
        connect by
                level <= 64
)
select
        substr(v1,-10)
from
        g1
;

execute snap_my_stats.end_snap
execute snap_ts.end_snap

set pause off

The calls to the package snap_my_stats are the usual ones I use (very old source at this Wayback URL) to report the session’s activity (v$mystat) between start and end snapshot and the calls in the snap_ts package do the same for the I/O per tablespace, summing bu tablespace across v$filestat and v$tempstat.

This script basically materializes and reports a GTT with a single column of 1,024 bytes, and in this case either 63 or 64 rows. You’ll note that I’ve built another assumption into the code that the CTE (if kept in memory) won’t follow all the details of a “real” table block, but will be a simple array with a tiny overhead per row – I’ve also assumed that the optimizer isn’t smart enough (or foolhardy enough) to push the substr() call inside the CTE.

I’ve set pause on and set both the arraysize and pagesize to a value less than the total number of rows I’ll be fetching so that I can check a couple of dynamic performance views (in particular v$sort_usage) from another session while the queries are running.

As a starting point, here are some critical values I’ve selected from the various outputs for just the 63 row case when running 11.2.04:

-------------
Session stats
-------------
Name                                                 Value
----                                                 -----
physical reads                                           9
physical reads cache                                     9
physical writes                                          9
physical writes direct                                   9
physical writes direct temporary tablespace              9

---------
T/S Stats 
---------
TS#        Reads      Blocks   Avg Csecs    Max      Writes      Blocks   Avg Csecs    Max Tablespace
----       -----      ------   ---------    ---      ------      ------   ---------    --- -------------------
   3           1           9        .000      1           1           9        .000      0 TEMP


Since I've enabled extended tracing at level 8 (waits) I can also show you the I/O waits reported for the reads and writes:
WAIT #140185232180256: nam='direct path write temp' ela= 180 file number=201 first dba=35073 block cnt=9 obj#=-1 tim=1645178184267768
...
WAIT #140185232180256: nam='db file scattered read' ela= 183 file#=201 block#=35073 blocks=9 obj#=-40016369 tim=1645178184268342

A dump of the 9 blocks of the temporary file (the “file number-201” translates to tempfile 1 since I have db_files = 200) starting at block 35073 looks like an ordinary table with pctfree 10 (which is one of the annoying details of GTTs – you can’t adjust it), 3 ITL slots (which is normal for CTAS) and 7 rows per block.

So, for 11g, we conclude that the effect of materializing a CTE is simply to create a GTT in the temporary tablespace, write it out using direct path writes, then read it back into the buffer cache using db file scattered reads. (You might want to confirm that this always happens, even if the CTE holds only one row.)

If you take advantage of the pause to issue “alter system flush buffer_cache” from another session you can also dump the segment header block (35072 in my case) to see that it’s a normal table segment header block – using freelist management, not ASSM because that’s the way temporary tablespaces have to be declared. The segment header block didn’t get written to disc in the normal course of the test.

12c Enhancement

This is the moment where the second query, and the pause that allows me to query v$sort_usage, becomes significant. When I started 12.2.0.1 with the 63 row query I saw:

  • No I/O on the temporary tablespace
  • No entry in v$sort_usage

To my great satisfaction the 64 row query did report I/O to the temporary tablespace (10 blocks this time – needing one extra block to cater for the 64th row) with v$sort_usage reporting a segment being created on my behalf. Obviously I re-ran the test a couple of times, flushing the buffer cache and shared pool, and connecting to a new session each time. The results were totally consistent: 63 rows => no GTT, 64 rows => GTT.

If you’re feeling a little suspicious at this point, bear with me.

This is the point where I switched to 19.11.0.0 – and both queries ran in memory with no sign of a GTT being created. Luckily I had cloned the query several times in the script generating different pairs of numbers of rows: 127/128, 255/256, 511/512, 1023/1024, and when I hit 1024 (and 1023) my session produced a GTT.

Somewhere between 512 and 1023 rows I was hitting a critical breakpoint – so I nearly started working through a binary chop to find the actually breakpoint; luckily, though, I had a little inspiration: if the overhead per row was 3 bytes (as it would be for a normal table column of more than 254 bytes) then 1023 rows would have an overhead of about 3KB – so I should test 1021 rows if I wanted to test a memory of just under 1MB.

Sure enough, at 1021 rows the GTT didn’t appear, at 1022 rows it did – time after time after time.

But …

My tests seemed to be totally repeatable. Again, I connected to a new session a few times, I flushed the buffer cache, I flushed the shared pool, I checked v$sort_usage. Consistently the results seemed to say:

  • 12.2 uses the PGA up to 64KB then dumps to a GTT
  • 19.11.0.0 uses the PGA up to 1MB then dumps to a GTT

Except – that night I had to shut down the two virtual machines because sometimes, for no obvious reason, I can’t hibernate my laptop while the VMs are running; and when I started everything up again the following morning and started re-running the tests as I wrote up the results something had changed. In fact my 19.11 instance didn’t dump to a GTT until I had reached nearly 10MB of data and my 12.2 wasn’t even dumping at 1MB; and I have no idea why a complete restart made such a difference.

After spending a little time trying to figure out what had changed – and I think it may have been that I’d been running the previous day’s tests after doing a lot of heavy work with temporary LOBs trying to pin down an anomaly with the handling of the temporary tablespace – I finally tried a google search using keywords that might be relevant and found this article that Keith Laker wrote about 5 years ago.

The feature is known as In-memory “cursor-duration” temporary table. I mentioned a clue in the execution plans at the start of this note: materialization shows up with a “temp table transformation” operation followed, in 11g, by with a child operation of “load as select”; but in 12.2 the child operation is “load as select (cursor duration memory)”. I really should have started my invesigation by putting the entire text of that operation into a search engine.

Summary

(Basically the highlights from Keith’s article):

  • The “in-memory cursor-duration”temporary table” change appeared in 12.2
  • It can be used in a number of transformations that the optimizer does
  • It’s not possible to force the use of the feature for a given query, it’s down to an internal algorithm
  • The mechanism uses memory that is “essentially” PGA
  • Despite the name this feature does not require you to licence the In-Memory option
  • If you’re still using an older version of Oracle this could be a good reason for upgrading as it can reduce the I/O load particularly for “analytic” types of query at a cost of using extra memory.

All the work I had done trying to find a break-point where Oracle switched from using PGA to using a GTT had been a waste of time – and the apparently consistent results on the first day had been an “accident” dictated (possibly) by some PGA-related activity that had taken place before I started running my tests .

Footnotes and geeky things

Five years on from the publication date of Keith’s article we shouldn’t be surprised to see some changes. Keith notes that the mechanism will apply only to serial queries that do more than one pass over the table – but there are two points to raise there:

  • possibly the two-pass thing is because it usually takes two passes over a CTE before Oracle will materialize a CTE automatically; my example shows the in-memory effect on a single pass – but that was a forced materialization.
  • the restrictions on parallelism may have been relaxed by now – if you check for hidden parameters you will find: _in_memory_cdt_maxpx, default value 4, described as “Max Parallelizers allowed for IMCDT”.

Interestingly there are more “cdt” parameters in 12.2 than there are in 19.11, and there are clear indications of some changes in algorithm and mechanism:

12c parameters like '%cdt%
Parameter                                  System Value         Description
--------------------------------------------------------- -------------------- ---------------------------------
_cdt_shared_memory_limit                   0                    Shared Memory limit for CDT
_cdt_shared_memory_query_percent           20                   Per Query Percentage from CDT pool
_imcdt_use_mga                             ON                   MGA for In Memory CDT
_in_memory_cdt                             LIMITED              In Memory CDT
_in_memory_cdt_maxpx                       4                    Max Parallelizers allowed for IMCDT

19g parameters like '%cdt%'
Parameter                                  System Value         Description
--------------------------------------------------------- -------------------- ---------------------------------
_hcs_enable_in_mem_cdt_hint                FALSE                add hint opt_param('_in_memory_cdt', 'off')
_in_memory_cdt                             LIMITED              In Memory CDT
_in_memory_cdt_maxpx                       4                    Max Parallelizers allowed for IMCDT

The parameter “_in_memory_cdt” can take the values ON, LIMITED, or OFF – which tells you that even if you can’t force a query to use in-memory CDTs you can (if you really want to) stop a query from using the feature. There are a few notes about this parameter and its significance to RAC and parallel execution (for 12.2) on MOS – if you have an account – Doc ID 2388236.1 What is _in_memory_cdt Parameter?

The reference to MGA (the “managed global area”) in 12.2 is also quite interesting. This is documented as a Solaris feature using OSM to share memory between processes. For more general details you can review MOS Doc ID 2638904.1 MGA (Managed Global Area) Reference Note (again, only if you have an account).

The “new” oradebug mechanism shows (from 18c) a couple of relevant components under SQL compilation and execution that you could trace if you want to investigate further.

Components in library RDBMS:
--------------------------
  SQL_Compiler                 SQL Compiler ((null))
    ICDT_Compile               In Memory CDT Compilation (qks3t)
  SQL_Execution                SQL Execution (qer, qes, kx, qee)
    ICDT_Exec                  In Memory CDT Execution (qes3t, kxtt)

January 19, 2022

Hash Aggregation – 2

Filed under: Execution plans,Infrastructure,Oracle,Performance,Tuning — Jonathan Lewis @ 12:03 pm GMT Jan 19,2022

In the note I wrote a couple of days ago about the way the optimizer switches from hash group by to sort group by if you add an order by X,Y clause to a query that has a group by X,Y I mentioned that this had been something I’d noticed about 15 years ago (shortly after Oracle introduced hash aggregation, in fact) but it was only the second of two details I’d noted when experimenting with this new operation. The first thing I’d noticed came from an even simpler example, and here’s a (cryptic) clue to what it was:


column operation_type format a24

select
        operation_type, count(*)
from
        V$sql_workarea
group by
        operation_type
order by
        operation_type
;

OPERATION_TYPE             COUNT(*)
------------------------ ----------
APPROXIMATE NDV                   1
BUFFER                          130
CONNECT-BY (SORT)                10
GROUP BY (HASH)                  12
GROUP BY (SORT)                 240
HASH-JOIN                       466
IDX MAINTENANCE (SORT)           39
LOAD WRITE BUFFERS               10
RANGE TQ COMPUTATION             13
RESULT CACHE                      4
SORT (v1)                        10
SORT (v2)                       147
WINDOW (SORT)                    35

The clue isn’t in the query, it’s in what’s missing from the result set, so here’s some code to create and query some data to make it easier to spot the anomaly:

rem
rem     Script:         hash_agg.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sept 2007
rem

create table t1
as
with generator as (
        select  rownum id
        from    dual
        connect by
                level <= 1e4 -- > comment to avoid wordpress format issue
)
select
        lpad(mod(rownum-1,1000),6)      small_vc_K,
        lpad(rownum-1,6)                small_vc_M
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid wordpress format issue
;

set serveroutput off

prompt  ===========================
prompt  1000 distinct values (hash)
prompt  ===========================

select
        /*+ gather_plan_statistics 1000 */
        count(*)
from
        (
        select  /*+ no_merge */
                distinct small_vc_K
        from
                t1
        )
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

prompt  ===========================
prompt  1000 distinct values (sort)
prompt  ===========================

select
        /*+ gather_plan_statistics 1000 */
        count(*)
from
        (
        select  /*+ no_merge no_use_hash_aggregation */
                distinct small_vc_K
        from
                t1
        )
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

I’ve added the /*+ gather_plan_statistics */ hint to the query so that I can check on the rowsource execution stats and (particularly) the memory and/or temporary space used; and I’ve blocked hash aggregation in one of the two queries, so I expect to see a “hash unique” operation in the first query and a “sort unique” operation in the second. Here’s what I get from 19.11.0.0:


===========================
1000 distinct values (hash)
===========================

  COUNT(*)
----------
      1000

SQL_ID  1baadqgv02r6b, child number 0
-------------------------------------
select  /*+ gather_plan_statistics 1000 */  count(*) from  (  select
/*+ no_merge */   distinct small_vc_K  from   t1  )

Plan hash value: 171694178

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |   765 (100)|      1 |00:00:00.07 |    2637 |   2632 |       |       |          |
|   1 |  SORT AGGREGATE      |      |      1 |      1 |            |      1 |00:00:00.07 |    2637 |   2632 |       |       |          |
|   2 |   VIEW               |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.07 |    2637 |   2632 |       |       |          |
|   3 |    HASH UNIQUE       |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.07 |    2637 |   2632 |  1889K|  1889K| 1417K (0)|
|   4 |     TABLE ACCESS FULL| T1   |      1 |   1000K|   405  (17)|   1000K|00:00:00.04 |    2637 |   2632 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

17 rows selected.

===========================
1000 distinct values (sort)
===========================

  COUNT(*)
----------
      1000

SQL_ID  a66rqhgw7a6dk, child number 0
-------------------------------------
select  /*+ gather_plan_statistics 1000 */  count(*) from  (  select
/*+ no_merge no_use_hash_aggregation */   distinct small_vc_K  from
t1  )

Plan hash value: 1750119335

-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |   765 (100)|      1 |00:00:00.22 |    2637 |       |       |          |
|   1 |  SORT AGGREGATE      |      |      1 |      1 |            |      1 |00:00:00.22 |    2637 |       |       |          |
|   2 |   VIEW               |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.22 |    2637 |       |       |          |
|   3 |    SORT UNIQUE       |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.22 |    2637 | 48128 | 48128 |43008  (0)|
|   4 |     TABLE ACCESS FULL| T1   |      1 |   1000K|   405  (17)|   1000K|00:00:00.02 |    2637 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

Look back at the summary of v$sql_workarea. Can you now spot something that might be missing? Since we’ve now got two execution plans with their sql_ids, let’s run a much more precise query against the view.

select
        sql_id, child_number, operation_id, operation_type ,
        total_executions, last_memory_used/1024 last_mem_kb
from
        v$sql_workarea
where
        sql_id in (
                '1baadqgv02r6b',
                'a66rqhgw7a6dk'
)
order by
        sql_id, child_number, operation_id
/

SQL_ID        CHILD_NUMBER OPERATION_ID OPERATION_TYPE       TOTAL_EXECUTIONS LAST_MEM_KB
------------- ------------ ------------ -------------------- ---------------- -----------
1baadqgv02r6b            0            3 GROUP BY (HASH)                     1       1,417
a66rqhgw7a6dk            0            3 GROUP BY (SORT)                     1          42

The first thing to note, of course, is that the sort aggregate at operation 1 doesn’t report a workarea at all; but this shouldn’t be a surprise, the operation is simply counting rows as they arrive, there’s no real sorting going on.

Operation 3 in both cases is the more interesting one. In the plan it’s reported as a “hash/sort unique” but the workarea operation has changed this to a “group by (hash/sort)”.  It’s a little detail that probably won’t matter to most people most of the time – but it’s the sort of thing that can waste a couple of hours of time when you’re trying to analyze a performance oddity.

And another thing …

You’ll have noticed, by the way, that the hash unique operation actually demanded far more memory than the sort unique operation; 1.4MB compared to 42KB. In a large production system this may be fairly irrelevant, especially since the discrepancy tends to disappear for higher volumes, and hardly matters if the operation spills to disc. However, if you have a large number of processes doing a lot of small queries using hash aggregation you may find that cumulatively they use up an unexpected fraction of your pga_aggregate_target.

There is some scope (with the usual “confirm with Oracle support” caveat) for modifying this behaviour with a fix_control:

16792882    QKSFM_COMPILATION_16792882  Cardinality threshold for hash aggregation(raised to power of 10)

alter session set "_fix_control"='16792882:3';

The default value is 0, the maximum value seems to be 3 and the description about raising to power of 10 seems to be the wrong way round, but I haven’t done any careful testing. When I set the value to 3 the hash unique with an estimated output of 1,000 switched to a sort unique.(hence my suggestion about 10N rather than N10).

Footnote

In the previous note I pointed out that Oracle would use a single sort group by operation rather than a consecutive hash group by / sort order by pair of operations. This simpler example helps to explain why. If you check the CBO trace files for the two you can check the costs of the inline aggregate view.

From a 19.11.0.0 trace file here are two costing fragments for the hash group by strategy; the first bit is the cost of the tablescan that acquires the rows (operation 4) the second is the final cost of the inline view / query block:


  Best:: AccessPath: TableScan
         Cost: 404.639881  Degree: 1  Resp: 404.639881  Card: 1000000.000000  Bytes: 0.000000

Final cost for query block SEL$2 (#2) - All Rows Plan:
  Best join order: 1
  Cost: 764.843155  Degree: 1  Card: 1000000.000000  Bytes: 7000000.000000
  Resc: 764.843155  Resc_io: 337.000000  Resc_cpu: 1069607888
  Resp: 764.843155  Resp_io: 337.000000  Resc_cpu: 1069607888

I won’t show you the equivalent extracts for the example where I blocked hash aggregation because the numbers are identical. So there’s no way that Oracle will want to do hash group by followed by sort order by, when it can do just a single sort group by that costs exactly the same as the hash group by operation alone. This is a shame, and a little ironic because when Oracle Corp introduce hash aggregation they made a bit of a fuss about how much more efficient it was than sorting- but it looks like no-one told the optimizer about this.

January 17, 2022

Hash Aggregation – 1

Filed under: CBO,Hints,Oracle,Performance,Tuning — Jonathan Lewis @ 1:15 pm GMT Jan 17,2022

Here’s an observation I made some time in 2007, but didn’t mention online until a (possibly) relevant question appeared on the Oracle database forum in 2017; and the topic reappeared in a nearly unrelated question a little while ago. The 2017 question was this:

I am running following SQL, which it taking almost 4-5 seconds and returning only 18 records.

SELECT
   SUM(TOTAL_AMOUNT) C1,
   a.LEVEL2_ENAME AS c2,
   c.CURR_YEARMO AS c3
FROM TERRITORY_PRESET_MSNP a,
   CV_RESTATED_MSNP b
   LEFT OUTER JOIN
   MONTH_D c
   ON b.YEARMO = c.CURR_YEARMO,
   PRODUCT_OFFERING d
WHERE   b.PO_ID = d.ROW_ID
    AND b.DATASOURCE_ID = 10
    AND b.YEARMO = 201704
    AND b.OWNER_TERR_ID = a.TERR_ID
    AND c.CURR_YEARMO = 201704
    AND a.YEARMO = 201706
GROUP BY c.CURR_YEARMO, a.LEVEL2_ENAME
ORDER BY C3, C2;

If I remove the ORDER BY clause it is returning results in 1 second.

Before saying anything else, I’ll just make a couple of points about the SQL:

  • It’s not a good idea to mix traditional Oracle syntax with “ANSI” syntax – it’s likely to make things harder for the next person to read the code and there’s just a slight possibility that the rewrite that Oracle applies to hide the ANSI syntax may block some of the possible execution paths.
  • The C3, C2 in the order by clause are the column aliases for the curr_yearno, level2_ename columns used in the group by clause.  Although Oracle allows you to use aliases in the order by (but not in the group by) doing so can only make the SQL a little harder to interpret (especially in a case like this when you have both clauses).
  • There’s a left outer join to month_d (aliased as c), but the where clause then includes the predicate c.CURR_YEARMO = 201704 which will eliminate any rows where curr_yearmo is null, thus converting (we hope – but the mix and match syntax might introduce a side-effect) the outer join to an inner join – so maybe that’s a design error in the SQL.

Addressing the question, though, the first thought (rapidly eliminated) is that perhaps this is the standard complaint of the GUI interface: “it’s fast until I add an order by clause”.

The commonest reason for this complaint is that the typical GUI interface shows you the first few rows and waits for you to page down, so your impression of the response time is “the time to see the first few rows” rather than “the time to get the complete result set” when it might take much more time to return the entire result set. When you add an order by clause it’s possible that Oracle will have to sort the entire result set before you see any of it. It’s often the difference between “first few rows” and “entire result set” that triggers the complaint.

In this case the “rapid elimination” of this thought is due to the OP saying the result set was only 18 rows. which is likely to produce the gut feeling that it shouldn’t take very long for Oracle to sort 18 rows if it had to find all of them before showing displaying them. On the other hand the thought might need a little follow-up, after all:

  • it’s possible that the GUI is only displaying 15 rows at a time and it’s takes a lot of time to find the extra 3 rows. Just think of a tablescan with a filter subquery when the rows you want are the first few in the table. Without an order by the rows can be displayed as they are found, with an order by Oracle will have to get to the end of the tablescan before the rows can be sorted and displayed.
  • the optimizer can produce terrible estimates and the order by clause might prompt it to say “if I start with a different table, driving through a specific index, and changing the join order then I won’t have to do any sorting for the order by clause” The resulting path may be a very bad idea if the arithmetic produces the wrong results.

The OP hasn’t shown us the execution plan – and that’s what we really need to see; but there is an alternative  guess that we could make about what the optimizer is doing that would affect the performance this much.

The query is an aggregate query – we have a group by. Since 10g Oracle has been able to use “hash aggregation” – which shows up as the HASH GROUP BY operation in an execution plan. Here’s a little demo script, with a couple of sample queries:

rem
rem     Script:         sort_hash.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2012
rem 

create table t1
as
with generator as (
        select
                rownum id 
        from dual 
        connect by 
                level <= 1e4  -- > comment to avoid wordpress format issue
)
select
        trunc(dbms_random.value(0,262144))      n_256K,
        trunc(dbms_random.value(0,131072))      n_128K,
        trunc(dbms_random.value(0,8192))        n_8k
from
        generator       v1,
        generator       v2
where
        rownum <= 8 * 1048576  -- > comment to avoid wordpress format issue
;

set arraysize 1000
set timing on
set autotrace traceonly 

prompt  ===========
prompt  No Order by
prompt  ===========

select
        n_8K, count(*) ct
from
        t1
group by
        n_8k
;


prompt  =============
prompt  With Order by
prompt  =============

select
        n_8K, count(*) ct
from
        t1
group by
        n_8k
order by
        1
;


My table has 8M rows, and my queries target the column with 8K distinct values. I’ve enabled autotrace from SQL*Plus, set a large arraysize (to reduce time lost to SQL*Net round-trips), and set timing on so we can get an elapsed time for total execution. I’ve set autotrace to “traceonly” so that the SQL*Plus client will fetch the data but won’t doesn’t waste resources formatting it, but I’m not actually interested in the handful of execution statistics that will be reported.

Here are the two sets of results from a test run on 19.11.0.0. Note, particularly, the figure for Elapsed:


===========
No Order by
===========

8192 rows selected.

Elapsed: 00:00:00.58

Execution Plan
----------------------------------------------------------
Plan hash value: 136660032

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   1 |  HASH GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
---------------------------------------------------------------------------

=============
With Order by
=============

8192 rows selected.

Elapsed: 00:00:03.03

Execution Plan
----------------------------------------------------------
Plan hash value: 3946799371

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   1 |  SORT GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
---------------------------------------------------------------------------

The time has jumped from slightly under 0.6 seconds to just over 3 seconds as the critical operation changes from a HASH GROUP BY to a SORT GROUP BY (even though the estimated cost, hence predicted run-time, of execution has not changed).

Your first thought at this point is probably along the lines of “surely it doesn’t take 2.4 seconds to sort 8,192 small rows, why doesn’t Oracle do a hash group by followed by a sort order by?” The answer seems to be “it just doesn’t”. So here’s one way to make it happen (with execution plan and elapsed time from 19.11.0.0 again):

select
        dist_8k, ct
from
        (
        select  /*+ no_merge */
                n_8K dist_8k, count(*) ct
        from
                t1
        group by
                n_8k
        )
order by 
        dist_8k
;

8192 rows selected.

Elapsed: 00:00:00.59

Execution Plan
----------------------------------------------------------
Plan hash value: 1705136228

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |  8192 |   208K|  6938  (61)| 00:00:01 |
|   1 |  SORT ORDER BY       |      |  8192 |   208K|  6938  (61)| 00:00:01 |
|   2 |   VIEW               |      |  8192 |   208K|  6938  (61)| 00:00:01 |
|   3 |    HASH GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
-----------------------------------------------------------------------------


Again the estimated cost of execution doesn’t (to the nearest whole number) change from the basic cost of the hash aggregation – but we have brought the time back down to just under 0.6 seconds.

It’s worth reminding you at this point that if you can re-engineer any SQL that’s performing badly and can see that the driving core of the query can be reduced to something much simpler and more efficient, then wrapping that core into an inline view with the /*+ no_merge */ hint (and possibly putting it up into a “with subquery” clause) might be the safest first step and most effective way of improving performance.

There is an option for avoiding the query rewrite here – hint the path you want to see:


select  /*+ use_hash_aggregation */
        n_8K, count(*) ct
from
        t1
group by
        n_8k
order by 
        1
;

8192 rows selected.

Elapsed: 00:00:00.59

Execution Plan
----------------------------------------------------------
Plan hash value: 2808104874

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   1 |  SORT ORDER BY      |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   2 |   HASH GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
----------------------------------------------------------------------------

The nice thing about this, of course, is that you don’t actually have to edit the text; the hint could be attached to the query through an SQL Patch (or by abusing the SQL Profile or SQL Plan Baseline mechanisms).

The difficult part of hinting is finding the correct query block name for a more complex query. I simply added the hint /*+ use_hash_aggregation */ but the hint can be aimed at a query block so, in the absence of explicit query block names I could have used the hint /*+ use_hash_aggregation(@sel$1) */ using the default naming.

In a more complex case you can find the appropriate query block name by using the ‘alias’ format option when generating the execution plan. Consider the following query (where t2 and t3 are created from view all_objects), with its initial execution plan:


explain plan for
select 
        t2.owner, count(*)
from 
        t2 
where 
        t2.object_id in (
                select  t3.object_id 
                from    t3 
                where   t3.object_type = 'TABLE'
        )
group by 
        t2.owner
order by 
        t2.owner
/


select * from table(dbms_xplan.display(format=>'alias'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 2646727453

------------------------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |    23 |   575 |   319   (8)| 00:00:01 |
|   1 |  SORT GROUP BY        |      |    23 |   575 |   319   (8)| 00:00:01 |
|*  2 |   HASH JOIN RIGHT SEMI|      |  2298 | 57450 |   317   (7)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL  | T3   |  2298 | 34470 |   158   (7)| 00:00:01 |
|   4 |    TABLE ACCESS FULL  | T2   | 57448 |   561K|   156   (6)| 00:00:01 |
------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$5DA710D3
   3 - SEL$5DA710D3 / T3@SEL$2
   4 - SEL$5DA710D3 / T2@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T2"."OBJECT_ID"="T3"."OBJECT_ID")
   3 - filter("T3"."OBJECT_TYPE"='TABLE')

The Query Block Name / Object Alias information tells us that the query block holding the sort group by at operation 1 is named SEL$5DA710D3 so we can use that as the target query block in the hint: /*+ use_hash_aggregation(@SEL$5DA710D3) */ and the plan changes to:


-------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |    23 |   575 |   319   (8)| 00:00:01 |
|   1 |  SORT ORDER BY         |      |    23 |   575 |   319   (8)| 00:00:01 |
|   2 |   HASH GROUP BY        |      |    23 |   575 |   319   (8)| 00:00:01 |
|*  3 |    HASH JOIN RIGHT SEMI|      |  2298 | 57450 |   317   (7)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL  | T3   |  2298 | 34470 |   158   (7)| 00:00:01 |
|   5 |     TABLE ACCESS FULL  | T2   | 57448 |   561K|   156   (6)| 00:00:01 |
-------------------------------------------------------------------------------

It’s possible that you’ll only notice a significant difference in performance (and, perhaps, PGA memory allocated) when you aggregate a large number of rows into a small result set. And it does seem that this is one case where the only way to get the plan you want without a significant code rewrite is through a hint

tl;dr

If you have some code which does a “group by X, Y, Z order by X, Y, Z” (the order of the columns/expressions used doesn’t have to be the same for the two clauses) then Oracle will use a sort group by operation to handle the aggregation and ordering in a single step, even though we can find cases where hash aggregation followed by sort ordering is more efficient.

If you come across such a case then injecting the hint /*+ use_hash_aggregation(@query_block_name) */ may be the only way to change the execution plan if you’re not allowed to edit the SQL.

Footnote

In the second of the two links to the Oracle Developer Forum you’ll see that one poster pointed out that if the order by clause uses a suitable expression to substitute for one of the columns in the group by clause then you don’t need to hint the code, e.g.

group by
        numeric_column
order by
        numeric_column + 0

Very cute, but not a good idea.

Footnote 2

There is another part to the testing I started in 2007, and I’ll come back to that later in Hash Aggregation – 2.

January 10, 2022

Control File Waits

Filed under: Infrastructure,LOBs,Oracle,Performance — Jonathan Lewis @ 2:10 pm GMT Jan 10,2022

Originally drafted 4th Aug 2011, this is a blog note that has been waiting a little over 10 years to be completed and published; and it’s probably worth doing it even now since the topic came up on the Oracle database forum just a couple of weeks ago.

The original title was CF Enqueues but I’ve combined it with notes from another draft called LOB write waits to get to the final title because a couple of small variations in my test script can produce all the annoying symptoms you may want to investigate in that area.

The forum note that triggered completion and publication was a complaint that’s common: “why is this so slow” but with a cause that comes up  less frequently: “BLOB operations and control file parallel writes”. The OP was running Oracle 19.9 but was suffering from a problem which (when I checked my archive and MOS) turned out to have a history (and a workaround) that started with 8.0.4. Their performance issue was due to their BLOB defintion which, with the storage information removed, was as follows:

LOB (CONTENT) STORE AS BASICFILE (
        ENABLE     STORAGE IN ROW
        CHUNK      32768
        PCTVERSION 10
        NOCACHE
        NOLOGGING
)

Note, particularly, the key words basicfile, nocache, and nologging. You should not be using basicfile LOBs in 19c, and both the nocache and nologging options can introduce problems of I/O and enqueue contention.

The impact of I/O and and enqueue contention depends, of course, on the nature of the work being done – you might be wasting effort or losing time but it might be a waste of resources that you’re not worred about and time that is effectively invisible to the end-user. For this user the task looked like an example of a “batch running as a pl/sql loop” repeatedly executing a statement similar to the following:


insert into table2(blob_column2)
select  blob_column1 
from    table1
where   id = :B1

A large fraction of the processing time was spent on “control file parallel write” (a consequence of nologging basicfile LOBs) and a further significant fraction was spent on “direct path write” (a consequence of nocache basicfile LOBs).

To understand the problem of the control file parallel write waits you need to know that every time a session makes a nologging change to a data file it also has to update an “unrecoverable SCN” field in the controlfile entry for that file. But if you’re inserting lots of individual LOBs then that turns into a lot of control file writes and the time spent on those writes, and the time spent waiting for other sessions to complete their writes and release the CF enqueue that might be protecting the control file from conflicting writes, can become significant.

As I said, this problem was first noted in 8.0.4, and an event was created to disable this control file update for LOBs (which were only of the basicfile variant at the time). The event was 10359 and the entry in $ORACLE_HOME/rdbms/mesg/oraus.msg for this event (which still has the same effect in 19.9) says:


10359, 00000, "turn off updates to control file for direct writes"
// *Cause:
// *Action:  Control files won't get updated for direct writes for  LOBs
//           when NOCACHE NOLOGGING is set. The only bad impact that it
//           can have is that if you are using the recovery manager,
//           it may affect a warning that says that the user should
//           back everything up. Now the recovery manager won't know
//           to tell you that the files that were updated with
//           unrecoverable events should be backed up.

Setting this event solved the OP’s performance problem.

It doesn’t really sound very nice, of course, setting an event that stops Oracle from telling you that “the files … should be backed up”, but the text does say the only bad impact is that it “may affect a warning” not that future attempts at recovery are going to fail. In fact there’s also MOS Doc ID: 1058851.6 Performance Degradation as a result of ‘enq: CF – contention (last updated March 2019 despite only having references to Oracle 8.0.4) which says (my emphasis):

If event 10359 is set to level 1, update of the control file with invalidation
redo is stopped.  This will not affect any recovery, but note that recovery
manager reports will be stale.

With event 10359 out of the way (but see footnote), I want to get back to my comment about switching to securefile LOBs. Here’s a very simple script I wrote a few years ago (and tweaked occasionally thereafter) to test the performance characteristics of LOBs:

rem
rem     Script: lob_write_waits.sql
rem     Author: Jonathan Lewis
rem     Dated:  Aug 2014
rem

drop table test_lobs;

@@setup

create table test_lobs (
        id              number(5),
        bytes           number(38),
        text_content    clob
)
lob (text_content)
store as 
--      securefile
        basicfile
        text_lob(
        disable storage in row
        nocache nologging
--
--      cache logging compress low                              --  0.26
--      cache reads logging compress low                        --  1.00
--      nocache logging compress low                            --  1.00
--
--      cache nologging compress low                            -- 14.00
--      cache reads nologging compress low                      --  1.00
--      nocache nologging compress low                          --  1.00
--
--      cache filesystem_like_logging compress low              -- 11.00
--      cache reads filesystem_like_logging compress low        --  1.00
--      nocache filesystem_like_logging compress low            --  1.00
)
;

insert into test_lobs values (1,1000, rpad('x',3000));
commit;

execute snap_my_stats.start_snap
execute snap_events.start_snap
execute snap_enqueues.start_snap

set serveroutput off
set feedback off

-- alter session set events '10359 trace name context forever, level 1';
-- alter session set db_unrecoverable_scn_tracking=false;
-- alter session set events '10046 trace name context forever, level 8';

declare
        c1 clob;
begin
        for r in 1..1000 loop
                insert into test_lobs values(r, 1000, lpad(r,1000,0));
                commit;
        end loop;
end;
/

commit;

alter session set events '10046 trace name context off';
alter session set events '10359 trace name context off';
alter session set db_unrecoverable_scn_tracking=true;

spool lob_write_waits

set serveroutput on
set feedback on

execute snap_enqueues.end_snap
execute snap_events.end_snap
execute snap_my_stats.end_snap

spool off

The script creates a table with a LOB column and inserts one row into it to avoid minor details of object creation and/or block formatting affecting the results I see from a loop test inserting 1000 rows one at a time with commit. You’ll notice that I’ve got lots of possible configurations for the LOB segment, including a choice between basicfile and securefile.

As it stands it will create a basicfile lob with the nocache nologging attributes (all the other options I have commented out are possible only for securefile LOBs thanks to the compress low clause.) Apart from the loop the script also allows a couple of parameters to be changed, or events to be set; and captures snapshots of sundry statistics.

If I allow the controlfile activity to take place, the following snippet shows the highlights of what my basicfile LOB does for me (or to me!) in 21c when the database is in ARCHIVELOG mode (which you shold be if you are trying to test the effects of any unrecoverable or nologging operation – the database is not in force_logging mode, by the way):

----------------------------------
System enqueues
Interval:-      6 seconds
----------------------------------
Type    Requests       Waits     Success      Failed    Wait m/s Reason
----    --------       -----     -------      ------    -------- ------
CF         2,006           0       2,006           0           0 contention
XR         1,000           0       1,000           0           0 database force logging
TM         1,002           0       1,002           0           0 contention
TX         1,031           0       1,031           0           0 contention

---------------------------------------------------------
Session Events
Interval:-      6 seconds
---------------------------------------------------------
Event                                             Waits   Time_outs            Csec    Avg Csec    Max Csec
-----                                             -----   ---------            ----    --------    --------
direct path sync                                  1,000           0           125.32        .125           6
control file sequential read                      5,014           0             5.50        .001           0
control file parallel write                       4,000           0           385.75        .096           6
direct path write                                 1,000           0             3.02        .003           5

---------------------------------
Session stats
Interval:-  6 seconds
---------------------------------
Name                                                                         Value
----                                                                         -----
total cf enq hold time                                                       3,953
total number of cf enq holders                                               1,000
physical writes                                                              1,000
physical writes direct                                                       1,000
physical writes direct (lob)                                                 1,000
redo entries                                                                 2,638
redo size                                                                1,647,276
redo size for direct writes                                                102,460
KTFB alloc space (block)                                                 9,437,184
ASSM gsp:get free block                                                      1,014
lob writes                                                                   1,000

As you can see, there’s a significant penalty due to the CF (controlfile) enqueues and control file reads and writes; and if I simply set event 10359 to level 1 (or set db_unrecoverable_scn_tracking to false) the most significant change is that the CF and control file I/O disappear and the job takes about 2 seconds instead of 6.

There’s nothing particularly interesting about the session activity stats, but since the activity is nologging it’s interesting to note that the redo size is nowhere near the 8MB we might expect (8KB blocks x 1,000 LOBs) even though we can see that Oracle has had to allocate 9MB of space for (mostly) the LOB data segment. Each nologging/unrecoverable LOB data block has averaged about 100 bytes of redo for tracking unrecoverable blocks. (When I changed the LOB chunk to 32K this KTFB allocation increased to 32MB – but that shouldn’t be a surprise).

When I switch to securefile LOBs (with the nocache nologging options) the results look like this:

----------------------------------
System enqueues
Interval:-      2 seconds
----------------------------------
Type    Requests       Waits     Success      Failed    Wait m/s Reason
----    --------       -----     -------      ------    -------- ------
TM         1,001           0       1,001           0           0 contention
TX         2,073           0       2,073           0           0 contention
DW         1,057           0       1,057           0           0 contention

---------------------------------------------------------
Session Events
Interval:-      2 seconds
---------------------------------------------------------
Event                                             Waits   Time_outs            Csec    Avg Csec    Max Csec
-----                                             -----   ---------            ----    --------    --------
direct path write                                 1,000           0            85.72        .086           4

---------------------------------
Session stats
Interval:-  2 seconds
---------------------------------
Name                                                                         Value
----                                                                         -----
redo entries                                                                 7,737
redo size                                                                3,594,280
KTFB alloc space (block)                                                 1,114,112
securefile allocation bytes                                              8,192,000
securefile allocation chunks                                                 1,000
securefile direct write bytes                                            8,192,000
securefile direct write ops                                                  1,000
securefile bytes non-transformed                                         2,000,000

If I use the compression low option at the same time the statistic about securefile bytes non-transformed drops to zero – not surprisingly – and I see 1,000 WG enqueues with reason “delete fso” and 3,000 WG enqueues with reason “lock fso”, but the timing is basically unchanged. The DW enqueue is the “In memory Dispenser” where the id1/id2 reference a tablespace and block address, and the WG enqueue is the “Write gather local enqueue” where the id1/id2 report the LOB ID (in two halves).

The most significant change, though, and the one that made me spend time on looking at results and traces etc. was what happened when I set the system to use “cache nologging” (or “cache filesystem_like_logging”) – the effect was really surprising; the report on enqueues didn’t show any significant differences (the same WG enqueues appeared if I enabled compression), but this is what happened with the session events:

---------------------------------------------------------
Session Events
Interval:-      12 seconds
---------------------------------------------------------
Event                                             Waits   Time_outs            Csec    Avg Csec    Max Csec
-----                                             -----   ---------            ----    --------    --------
write complete waits                              1,000           0         1,007.02       1.007           5

For every LOB write there was a write complete wait that averaged close to one centisecond! That’s a fairly suspect number, looking more like a time-out from a self-timer than a wait that’s terminated by some sort of signal of an action completed. After thinking about what might be happening and browsing through the session activity stats I noticed the following clue:

---------------------------------
Session stats
Interval:-  12 seconds
---------------------------------
Name                                                                         Value
----                                                                         -----
messages sent                                                                2,000

That looks like two messages sent for each LOB inserted – which I confirmed by doing a few more tests with different numbers of LOBs. So who is received those messges (and not replying to them)? I re-ran the test taking a snapshot of the session stats for every session in the instance and extracted (using grep) the figures for just the “messages sent” and “messages received”. Here are the results, rearranged slightly:

SID:   283: - oracle - oracle@linux21.localdomain (W003)
messages sent                                                              4
SID:   293: - oracle - oracle@linux21.localdomain (W006)
messages sent                                                              3
SID:   294: - oracle - oracle@linux21.localdomain (W005)
messages sent                                                              4
SID:    20: - oracle - oracle@linux21.localdomain (W000)
messages sent                                                              8
SID:    33: - oracle - oracle@linux21.localdomain (W002)
messages sent                                                              2
SID:    44: - oracle - oracle@linux21.localdomain (W007)
messages sent                                                              3

SID:    51:TEST_USER - jonathan - sqlplus@linux21.localdomain (TNS V1-V3)
messages sent                                                          2,004    (2,000)
SID:   263: - oracle - oracle@linux21.localdomain (DBW0)
messages sent                                                          1,989    (2,000)
messages received                                                      3,021    (3,000)
SID:    11: - oracle - oracle@linux21.localdomain (LGWR)
messages sent                                                          1,021    (1,000)
messages received                                                      2,019    (2,000)

If we ignore the messages sent by Wnnn (which, I believe, go to DBW0) and allow for a few timing/rounding errors, we might decide to work with the figures in brackets in the above. With a few extra tests, changing “commit;” to “commit write wait;” or commenting out the “commit;” my best interpretation (i.e. guess) of these numbers came down to:

  • The user session sends 1,000 messages to DBWR (to write the nologging LOB) and 1,000 messages to LGWR to write the “ordinary” table redo (and the number 1,000 also show up in the “redo synch writes” stat and  “log file sync” waits if we use “commit write wait”)
  • LGWR receives 1,000 messages from the session, and 1,000 messages from DBW0 (which would relate to “LOB management” with redo OP code 26.4 and 13.53) and sends 1,000 completion messages to DBW0.
  • DBW0 receives 1,000 messages from the session and 1,000 messages from LGWR, and sends 1,000 “log LOB Maintenance” messages to LGWR.

And if you’ve been counting carefully you’ll have noticed that there are still 1,000 message sent and received by DBW0 that I haven’t been able to account for. It’s possibly a “double-counting” error; or maybe there’s another process involved in a dialogue with DBW0 that doesn’t record its end of the dialogue, or maybe it’s DBW0 sending messages to itself(!). At some point I may remember how to switch on message tracing and check exactly which messages are sent from where to where, but for the moment I’m going to put a line under the investigation and just highlight the critical effect.

Whatever’s going on, a key detail from the experiments above is that the session does not seem to get any sort of message telling it that the unlogged LOB is safely on disc, it just goes to sleep for 1/100 second – presumably checking the buffer status when it wakes up to decide whether or not the cached LOB has been written.

Footnote

You may have come across the parameter db_unrecoverable_scn_tracking which appeared some time around 11.2.0.4 with the description: “Track nologging SCN in controlfile”. If this parameter is set to true (the default value) at instance startup then you can use event 10359 to switch the feature on and off at the session (and possibly system) level. More elegantly, though, you can set the parameter at the session level instead of using the numeric event to address the problem:

alter session set db_unrecoverable_scn_tracking=false;

I would advise against setting the value to false at instance startup, or even through a call to “alter system”; there’s a comment in Doc ID 12360160.8 – ALTER SYSTEM / SESSION of DB_UNRECOVERABLE_SCN_TRACKING that if the parameter is set to false at startup it can’t be changed dynamically thereafter.  (I haven’t tested if this last detail is still true.)

tl;dr

You really shouldn’t be using basicfile LOBs with newer versions of Oracle, you need to convert them to securefile LOBs.

If you do use basicfile LOBs then you may need to set event 10359 (or adjust the db_unrecoverable_scn_tracking parameter) if you want them to be nologging LOBs.

Even with securefile LOBs the options “cache nologging” and “cache filesystem_like_logging” introduce a surprising I/O wait time of 1/100 second per LOB, so those are two options to avoid.  All other securefile variations seemed to behave nicely (for me) but the amount of undo and redo can vary considerably with option so even with a 1/100 second time penalty you might still choose to use one of these options.

 

December 1, 2021

Best Practice

Filed under: Execution plans,Oracle,Performance — Jonathan Lewis @ 7:08 pm GMT Dec 1,2021

This showed up recently on the Oracle Developer Forum.

For a while I’m questioning myself about a best practice query for performance.

I’ve 2 tables: EMPLOYEES and DEPARTMENTS. The EMPLOYEES table can contains millions of records. Which query is better for performance ?

Select t1.emp_id, t1.emp_name, t2.dept_name
from EMPLOYEES t1, DEPARTMENTS t2
where t1.dept_id = t2.dept_id
order by t1.emp_name
Select emp_id, emp_name, (select mypackage.get_dept_name(dept_id) from dual) as dept_name
from EMPLOYEES
order by emp_name

The nice thing thing about this question is the correct use of English grammar; it uses “better” rather than “best” (the former is correct when comparing two things, the latter requires at least three things to be compared). On the down-side, any suggestion that a single pattern might be “best practice” for performance should be viewed with great caution; it’s far too easy for the less experience to translate “this is a good idea for this query” as “you should always do this”. At best it’s possible to identify patterns at the edges of style where you could say either: “this usually works well enough” or “you shouldn’t do this unless it’s really necessary”.

So I’m going to ignore the generic question and pretend that the question was supposed to be about the performance of the specific case; and then I’m going to complain that the question is badly posed because it omits several pieces of information that are likely to be relevant.

Some points to consider as background before looking too closely into performance:

  • We might expect that there is a referential integrity (RI) constraint between departments and employees. We might also expect to see emp.dept_id declared as not null – if it’s not then do we really want employees who are not yet assigned to a department to disappear from the report. Maybe the join should be an outer join.
  • How many departments are there. If there are millions of employees might there be thousands, or even tens of thousands, of departments? That’s going to make a difference to performance and (depending on strategy) to the consistency of the performance.
  • How important is it to squeeze the last bit of performance from this query. An Oracle database can perform perfectly well with millions of rows but how rapidly can the network transfer the rows from the database to the client, and how well is the client process going to deal with such a large result set. (Why does the client want to run a query that returns millions of rows – is the client then going to do something, after a long and tedious data transfer, that should have been done in 1/100th of the time in the database to supply a much smaller result set).
  • How competent are the people who may be required to support the query once it has gone into production. Should you follow the principle of preferring simplicity to optimum performance. What’s the difference in (database) performance between the better performance and the one that’s easier to understand?
  • Is there any reason why the query must not use parallel execution. Depending on the coding strategy adopted it’s possible that the slower serial query becomes the faster parallel query, and parallelism hasn’t even been mentioned.
  • Do you think that the version of Oracle may change the relative performance of different strategies for the query? (The correct answer is “yes”, and the query that uses a function call and scalar subquery is the one that’s more likely to be affected by future tweaks to the Oracle internal processing.)

So let’s look at specifics.

I’m going to ignore the option for using of a PL/SQL function to generate a department name from an id. Possibly the hope that using it would reduce the number of times a piece of SQL was executed against the dept table; but you shouldn’t use the public result cache for such a lightweight call, particularly when the session’s scalar subquery cache would probably get all the benefit that was available anyway.

Assume, then, that the referential integrity and not null declarations are in place, and that the number of departments is vaguely commensurate with the millions of employees. I’ve set up a little test with 50,000 rows in an “emp” table and 200 rows in a “dept” table just to have some concrete results to prompt a few thoughts.

Given the nature of the query (no predicates to limit data selected) the simple join with “order by” probably has only one sensible path:

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |  50000 |00:00:00.03 |     953 |       |       |          |
|   1 |  SORT ORDER BY      |      |      1 |  50000 |  50000 |00:00:00.03 |     953 |  3738K|   834K| 3322K (0)|
|*  2 |   HASH JOIN         |      |      1 |  50000 |  50000 |00:00:00.01 |     953 |  1335K|  1335K| 1566K (0)|
|   3 |    TABLE ACCESS FULL| DEPT |      1 |    200 |    200 |00:00:00.01 |       5 |       |       |          |
|   4 |    TABLE ACCESS FULL| EMP  |      1 |  50000 |  50000 |00:00:00.01 |     948 |       |       |          |
-----------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DEPT"."ID"="EMP"."DEPT_ID")

I’ve run this with with rowsource execution stats enabled so that I can highlight details that might suggest ways to save resources like CPU or memory consumption. This gives rise to two thoughts:

First, would the subquery method do anything to reduce the volume of data sorted – i.e. would the subquery run late; secondly, even if it didn’t run late might it reduce the work needed to translate department ids into department names.

Here’s a version of the query using a scalar subquery with the resulting execution plan:

select 
        emp.id, emp.ename,
        (select dept.dept_name from dept where dept.id = emp.dept_id) dept_name
from 
        emp
order by
         emp.ename
/

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |      1 |        |  50000 |00:00:00.05 |    5884 |       |       |          |
|   1 |  TABLE ACCESS BY INDEX ROWID| DEPT    |   4931 |      1 |   4931 |00:00:00.01 |    4935 |       |       |          |
|*  2 |   INDEX UNIQUE SCAN         | DEPT_PK |   4931 |      1 |   4931 |00:00:00.01 |       4 |       |       |          |
|   3 |  SORT ORDER BY              |         |      1 |  50000 |  50000 |00:00:00.05 |    5884 |  3738K|   834K| 3322K (0)|
|   4 |   TABLE ACCESS FULL         | EMP     |      1 |  50000 |  50000 |00:00:00.01 |     949 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DEPT"."ID"=:B1)

The plan for any query with scalar subqueries in the select list reports the subqueries at the same level as the main query and before the query – so we can see here that the scalar subquery has been executed only 4,931 times, rather than the full 50,000, and that’s the benefit of scalar subquery caching. The execution count is in the thousands rather than being 200 (number of departments) because the cache is rather small and when hash values for department ids collide hashing only the first into the cache is retained (the cache doesn’t use linked lists to handle collisions).

There’s some ambiguity here, though. We can see that the Buffers statistic at operation 3 (the Sort) matches the total for the query, and consists of the value for the table scan plus the value for the subquery execution (5884 = 4935 + 949, but don’t ask me why the emp tablescan reports 949 buffers instead of the 948 previously reported) so we can’t easily tell whether Oracle executed the subquery before or after it had sorted the data from the tablescan. This could make a difference to performance since including the department name in the data to be sorted could mean using a lot more memory (and more I/O if the larger volume resulted in the sort spilling to disc). We can do a quick check with a simplified query – selecting only the department id from emp, rather than translating id to name.

select 
        emp.id, emp.ename,
        emp.dept_id
--      (select dept.dept_name from dept where dept.id = emp.dept_id) dept_name
from 
        emp
order by
         emp.ename
/

----------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  50000 |00:00:00.02 |     948 |       |       |          |
|   1 |  SORT ORDER BY     |      |      1 |  50000 |  50000 |00:00:00.02 |     948 |  2746K|   746K| 2440K (0)|
|   2 |   TABLE ACCESS FULL| EMP  |      1 |  50000 |  50000 |00:00:00.01 |     948 |       |       |          |
----------------------------------------------------------------------------------------------------------------


With this simpler query the Used-Mem for the sort operation drops from 3322K to 2440K so Oracle must have been using the scalar subquery to fetch the department name before sorting. (I don’t know why Buffers for the tablescan has dropped to 948 again.)

So maybe it would be worth rewriting the query to sort the “narrower” data set before calling the scalar subquery:

select
        /*+ 
                qb_name(main) 
                no_merge(@inline)
                no_eliminate_oby(@inline)
        */
        v1.id, v1.ename, 
        (select /*+ qb_name(ssq) */ dept.dept_name from dept where dept.id = v1.dept_id) dept_name
from    (
        select 
                /*+ qb_name(inline) */
                emp.id, emp.ename, emp.dept_id
        from 
                emp
        order by
                emp.ename
        )       v1
/

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |      1 |        |  50000 |00:00:00.03 |     949 |       |       |          |
|   1 |  TABLE ACCESS BY INDEX ROWID| DEPT    |   4909 |      1 |   4909 |00:00:00.01 |    5112 |       |       |          |
|*  2 |   INDEX UNIQUE SCAN         | DEPT_PK |   4909 |      1 |   4909 |00:00:00.01 |     203 |       |       |          |
|   3 |  VIEW                       |         |      1 |  50000 |  50000 |00:00:00.03 |     949 |       |       |          |
|   4 |   SORT ORDER BY             |         |      1 |  50000 |  50000 |00:00:00.02 |     949 |  2746K|   746K| 2440K (0)|
|   5 |    TABLE ACCESS FULL        | EMP     |      1 |  50000 |  50000 |00:00:00.01 |     949 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DEPT"."ID"=:B1)

As you can see, my execution plan now has a View operation telling us that the Sort operation completed inside a non-mergable view; and we can see that the Buffers and the Used-Mem statistics for the Sort operation has stayed at the 948(+1 again) buffers and 2440K of the simpler query. Of course, the need for the data (i.e. sorted rowsource) to pass up the plan through the view operation means a little extra CPU at that point before we start executing the subquery. By a lucky fluke the hash collisions on the scalar subquery cache have dropped slightly because the data department ids are arriving in a different order – this can be a good thing but, as you will have seen in the linked article above on filter subqueries, the change of order could have been a disaster. (More articles on scalar subquery caching – and deterministic functions – linked to from this URL)

You might also note the odd little glitch in the reporting – the final figure for Buffers has “lost” the result from the scalar subquery. (And this was on 19.11.0.0 – so it’s not an old error.)

So it seems we can do better than either of the options supplied in the original post. But the important question then is whether we should adopt that strategy or not. And the answer in many cases will be “probably not, unless it’s in very special circumstances and carefully documented”.

I’ve used a no_merge() hint and the undocumented no_eliminate_oby() hint; and that happens to have done what I wanted in the release of Oracle I was using; but in principle the optimizer could unnest the “@qb_name(ssq)” scalar subquery and produce a join between the “@qb_name(inline)” inline view and the unnested table – and that might cause the ordered data to become disordered (even though the sort order by operation would still be taking place). That’s a little unlikely, of course, but if it did happen someone would have to notice and deal with an urgent rewrite. (Strangely the reason why the optimizer doesn’t do this unnest in 19c and 21c is because – according to the 10053 trace file – the subquery “might return more than one row”, even though it’s selecting by equality on primary key!)

Summary Observations

Resist the temptation to ask for “best practices” for writing performant SQL. There are no immediately practical answers to the question (though there may be a couple of suggestions about how not to write your SQL).

You don’t necessarily want to put the more performant query into production.

Footnote

If you want to play around with this model, here’s the script I used to generate the data and run the test:

Click here to expand/contract the text
rem
rem     Script:         emp_dept_optimise.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sept 2021
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem

define m_dept=200

drop table emp;
drop table dept;

execute dbms_randoms.seed(0)

create table dept as
select 
        rownum id, 
        lpad(dbms_random.value,20) dept_name, 
        rpad('x',100) padding
from
        all_objects
where 
        rownum &amp;lt;= &amp;amp;m_dept
/

alter table dept add constraint dept_pk primary key(id);

create table emp as
select 
        rownum id, 
        1 + mod(rownum-1,&amp;amp;m_dept) dept_id,
        lpad(dbms_random.value,20) ename, 
        rpad('x',100) padding
from
        all_objects
where
        rownum &amp;lt;= 50000
/

alter table emp add constraint emp_pk primary key (id)
/

alter table emp add constraint emp_fk_dept foreign key(dept_id) references dept
/

set linesize 180
set arraysize 250

set termout off
set serveroutput off

alter session set statistics_level = all;

select 
        emp.id, emp.ename,
        emp.dept_id
--      (select dept.dept_name from dept where dept.id = emp.dept_id) dept_name
from 
        emp
order by
         emp.ename
/

spool emp_dept_optimise.lst

prompt  ============================================
prompt  For a baseline on memory, query with dept_id
prompt  ============================================

select * from table(dbms_xplan.display_cursor(format=&amp;gt;'allstats last'));
spool off

select 
        emp.id, emp.ename,
        dept.dept_name
from 
        emp,
        dept
where
        dept.id = emp.dept_id
order by
         emp.ename
/

spool emp_dept_optimise.lst append

prompt  ======================================
prompt  For a simple join between emp and dept
prompt  ======================================

select * from table(dbms_xplan.display_cursor(format=&amp;gt;'allstats last projection'));
spool off

select 
        emp.id, emp.ename,
--      emp.dept_id
        (select dept.dept_name from dept where dept.id = emp.dept_id) dept_name
from 
        emp
order by
         emp.ename
/

spool emp_dept_optimise.lst append

prompt  ==============================================
prompt  Now switching to the dept_name scalar subquery
prompt  ==============================================

select * from table(dbms_xplan.display_cursor(format=&amp;gt;'allstats last'));
spool off



select
        /*+ 
                qb_name(main) 
                no_merge(@inline)
                no_eliminate_oby(@inline)
        */
        v1.id, v1.ename, 
        (select /*+ qb_name(ssq) */ dept.dept_name from dept where dept.id = v1.dept_id) dept_name
from    (
        select 
                /*+ qb_name(inline) */
                emp.id, emp.ename, emp.dept_id
        from 
                emp
        order by
                emp.ename
        )       v1
/

spool emp_dept_optimise.lst append

prompt  ==================================================
prompt  Now with the dept_name scalar subquery 'postponed'
prompt  ==================================================

select * from table(dbms_xplan.display_cursor(format=&amp;gt;'allstats last outline'));
spool off

set serveroutput on
alter session set statistics_level = typical;

August 6, 2021

Sequence Accelerator

Filed under: 19c,Infrastructure,Oracle,Performance — Jonathan Lewis @ 11:22 am BST Aug 6,2021

Update Feb 2022: There’s a threat warning (and avoidance mechanism) reported in comment 6 about side effects of this new feature, especially in a RAC system.

Connor McDonald has just published a blog note about a tweak to sequences that appeared in recent versions of Oracle (19.10 – see tweet from Timur Akhmadeev).

To address the problems caused by people leaving the sequence cache size at the default of 20 (leading to contention on very busy sequences – see footnote) Oracle’s internal code will now check the rate at which a sequence nextval is being called and “ignore” the cache definition, using larger and larger values to bump the sequence highwater in the updates to the seq$ table.

Connor pointed out that if you really wanted to see how big the jump might get you could crash your instance in the middle of a run, and see how large the gap in the sequence was at the next startup. But if you want to experiment a little further with the feature here’s a less painful way of doing it – enable SQL trace for just the sequence update statement – which in current versions has an SQL_ID of 4m7m0t6fjcs5x:

alter system  set events 'sql_trace[SQL:4m7m0t6fjcs5x] wait=false, bind=true';

-- wait a bit

alter system  set events 'sql_trace[SQL:4m7m0t6fjcs5x] off';

I’ve shown how to set the trace at the system level but it is possible to use the session level, and I’ve requested bind variables to be dumped on every execution of the statement. After you’ve got some trace files you can examine them to pick out the relevant values. (In a unix environment I’d use grep and awk to automate this).

Here’s a little script to create a table and sequence, enable tracing, then hammer the sequence. I’ve left everything to default so the sequence cache will be 20 and on older versions of Oracle we’d see the highwater mark of the sequence incremented by 20 on each update to seq$. I’m running 19.11.0.0

rem
rem     Script:         trace_seq_update.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Aug 2021
rem
rem     Last tested 
rem             19.11.0.0
rem

define m_sql_id = '4m7m0t6fjcs5x'

drop table t1;
drop sequence s1;

create table t1 (n1 number (10,0));
create sequence s1;

insert into t1 values(0);
commit;

spool trace_seq_update.lst

select  cache 
from    user_sequences 
where   sequence_name = 'S1'
;


alter session  set events 'sql_trace[SQL:&m_sql_id] wait=false, bind=true';

insert into t1 select s1.nextval 
from    all_objects
where   rownum <= 5000
/

alter session  set events 'sql_trace[SQL:&m_sql_id] off';

And here’s the first extract from the trace file:

PARSING IN CURSOR #140658757581416 len=129 dep=1 uid=0 oct=6 lid=0 tim=338002229669 hv=2635489469 ad='77638310' sqlid='4m7m0t6fjcs5x'
update seq$ set increment$=:2,minvalue=:3,maxvalue=:4,cycle#=:5,order$=:6,cache=:7,highwater=:8,audit$=:9,flags=:10 where obj#=:1
END OF STMT
BINDS #140658757581416:

To find the values used to update highwater from this point onwards I just kept searching for “BINDS #140658757581416:”,stepping down to “Bind#6”, and reporting the “value=” line that was 4 lines beyond that.

If you want to repeat the tests you’ll (probably) find that your cursor number (BINDS #nnnnnnnnnnnn) is difference. If you’ve done a system-wide trace, of course, you might have multiple sequences updated in the same trace file, in which case you’ll also need to report the value for “Bind#9” to separate the different sequences. Moreover, just to make automatic harder, you may find that the update cursor closes and re-opens with a new cursor number from time to time.

Here’s the complete list of Bind#6 entries for my test:

 Bind#6
  oacdty=02 mxl=22(02) mxlc=00 mal=00 scl=00 pre=00
  oacflg=10 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=76723ce7  bln=22  avl=02  flg=09
  value=21

 Bind#6
  oacdty=02 mxl=22(03) mxlc=00 mal=00 scl=00 pre=00
  oacflg=10 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=76723ce7  bln=22  avl=03  flg=09
  value=221

 Bind#6
  oacdty=02 mxl=22(03) mxlc=00 mal=00 scl=00 pre=00
  oacflg=10 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=76723ce7  bln=22  avl=03  flg=09
  value=2221

 Bind#6
  oacdty=02 mxl=22(04) mxlc=00 mal=00 scl=00 pre=00
  oacflg=10 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=76723ce7  bln=22  avl=04  flg=09
  value=22221

As you can see, the highwater jumps by 20, then 200, then 2,000 then 20,000. As a preliminary hypothesis it looks as if Oracle is going to take the cache size (BIND#5) as a base, but escalate that value internally by powers of 10 until the frequency of updates to seq$ drops to an acceptable value. (The value of cache – Bind#5 – isn’t changed by this mechanism, however).

Connor supplies a link to the 21c documentation on the feature – but it’s a bit thin, so there’s some scope for checking if RAC or scaled/expanded sequences produce any unexpected or puzzling side-effects.

Timur’s tweet also supplied a reference to MOS Doc ID: 2790985.1 Sequence dynamic cache resizing feature which has some helpful technical details.

Footnote

I’ve written a 4-part series on sequences for Simpletalk, starting at this URL.

July 9, 2021

19c tweak 2

Filed under: CBO,Oracle,Performance — Jonathan Lewis @ 4:40 pm BST Jul 9,2021

Trying to find out why a plan had changed in the upgrade from 11g to 19c I came across this cunning little tweak that must have appeared in the 19c timeline. I’ll start with a simple query, then the execution plans (autotrace traceonly) from 19.11.0.0 – first with the parameter optimizer_features_enable set to 18.1.0, then with the it set to 19.1.0. The table t1 is a copy of the first 10,000 rows of view all_objects:

SQL> alter session set optimizer_features_enable = '18.1.0';
SQL> select count(data_object_id) from t1 where f1(object_id) = 'Y';

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     7 |    38  (37)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     7 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |   100 |   700 |    38  (37)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("F1"("OBJECT_ID")='Y')



SQL> alter session set optimizer_features_enable = '19.1.0';
SQL> select count(data_object_id) from t1 where f1(object_id) = 'Y';

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     7 |    26   (8)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     7 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     5 |    35 |    26   (8)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DATA_OBJECT_ID" IS NOT NULL AND "F1"("OBJECT_ID")='Y')

Optimising with the optimizer features set back to 18.1 the cardinality estimate is 100 (that’s 1% of the rows in the table, the standard guess for “function() = constant”) with a cost of 38, of which 37% is CPU cost.

Running with the optimizer features of 19c enabled the cardinality estimate drops to 5 and the cost drops to 26 with CPU making up 5% of the cost. Where does the difference come from?

As ever you have to look at the Predicate Information. Running as 18c Oracle has decided to call my function for every row in the table; running as 19c Oracle has decided that since I’m counting non-null entries of column data_object_id it need only call the function when data_object_id is not null, so it’s introduced an extra predicate to make that happen, and that extra predicate has reduced the cardinality and cost estimates. (In my sample data set there are 9,456 nulls and 544 distinct values for data_object_id – so the difference in workload is significant. And 1% of 544 is 5, which explains the cardinality estimate.)

This looks like fix control 24761824 “add is not null for high null column in set function” introduced in 19.1.0. The description suggests that the feature will only be used in cases where the column is “often” null, but we have no clue, yet, about what “often” means. [Update 12th July: thanks to comment #1 below from Andi Schloegl we now have a pretty good idea that the break point is at 5%.]

This means that there may be cases where an execution plan changes on an upgrade to 19c because a tablescan has become cheaper or a cardinality estimate has been reduced.

Just as a confirmation of how the change in plan is echoing reality, here are the execution plans pulled from memory after executing them with the statistics_level set to all to enable collection of the rowsource execution statistics. First the 18c plan, then the 19c plan:

-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:11.14 |    1780K|
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:11.14 |    1780K|
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    100 |  10000 |00:00:11.14 |    1780K|
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("F1"("OBJECT_ID")='Y')



-------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.46 |   97010 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |      1 |00:00:00.46 |   97010 |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |      5 |    544 |00:00:00.46 |   97010 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("DATA_OBJECT_ID" IS NOT NULL AND "F1"("OBJECT_ID")='Y'))

As you can see, the buffer gets has dropped from 1,780K in 18c to 97K in 19c (mainly because the function results in a tablescan of a table of 178 blocks and the number of calls has dropped from 10,000 to 544), and the run time has dropped from 11.14 seconds to 0.46 seconds.

Code

If you want to run and refine this test, here’s the code I used to generate the data.

rem
rem     Script:         19c_not_null_tweak.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2021
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem

create table t1 as select * from all_objects where rownum <= 10000;
create table t2 as select * from t1;

create or replace function f1(i_obj in number) return varchar2
is
        n1 number;
begin
        select count(*) into n1 from t2 where object_id = i_obj;

        if n1 = 0 then
                return 'N';
        else
                return 'Y';
        end if;
end;
/

set autotrace traceonly explain

alter session set optimizer_features_enable = '18.1.0';
select count(data_object_id) from t1 where f1(object_id) = 'Y';

alter session set optimizer_features_enable = '19.1.0';
select count(data_object_id) from t1 where f1(object_id) = 'Y';

set autotrace off

set serveroutput off
alter session set statistics_level = all;

alter session set optimizer_features_enable = '18.1.0';
select count(data_object_id) from t1 where f1(object_id) = 'Y';
select * from table(dbms_xplan.display_cursor(format=>'allstats last'));

alter session set optimizer_features_enable = '19.1.0';
select count(data_object_id) from t1 where f1(object_id) = 'Y';
select * from table(dbms_xplan.display_cursor(format=>'allstats last'));

alter session set statistics_level = typical;
set serveroutput on

March 25, 2021

v$resource_limit

Filed under: Oracle,Performance,Troubleshooting — Jonathan Lewis @ 2:59 pm GMT Mar 25,2021

From time to time I see people on the public Oracle forums asking whether they should adjust one or other of the resource-related parameters – and it’s often the hidden or derived parameters that get targetted for this type of request. For example I came across a request fairly recently that said:

I’ve got a problem with the CF enqueue, I see lots of time spent on waits for this enqueue, should I increase the parameter _enqueue_locks?

Quick tip: if you have to ask a question like this the answer is almost certainly “no”. On the other hand if you can present a rational argument why an observation might point you to a parameter and an explanation of why you think the change might help there’s a simple check that you could do (for some parameters) before you ask the question. There’s a dynamic performance view that lists the utilitisation of a number of the special “resource” parameters and lets you see very easily whether you’re reaching the limit – it’s the view named in the title of this piece: v$resource_limit.

Here’s a simple script to report the contents of the view. It’s a script that has to be run by SYS (unless you care to grant suitable privileges to a non-SYS user), and in a container database it has to be executed in CDB$ROOT.

rem
rem     Script:         resource_limit.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2007
rem

set linesize 180
set pagesize 60
set trimspool on

column resource_name            format a32
column max_utilization          format 999,999
column current_utilization      format 999,999
column initial_allocation       format a18
column limit_value              format a11

spool resource_limit.lst

select
        resource_name,
        max_utilization,
        current_utilization,
        lpad(initial_allocation,18)     initial_allocation,
        lpad(limit_value,11)            limit_value
from
        v$resource_limit
;

spool off

The following is a sample output from a small 19.3 instances shortly after startup. In this case you can see that the RAC (ges) resources all show zero utilisation, and most of the others are fairly low.

RESOURCE_NAME                    MAX_UTILIZATION CURRENT_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
-------------------------------- --------------- ------------------- ------------------ -----------
processes                                     92                  85                360         360
sessions                                     109                  83                564         564
enqueue_locks                                 94                  39               6644        6644
enqueue_resources                             68                  31               2700   UNLIMITED
ges_procs                                      0                   0                  0           0
ges_ress                                       0                   0                  0   UNLIMITED
ges_locks                                      0                   0                  0   UNLIMITED
ges_cache_ress                                 0                   0                  0   UNLIMITED
ges_reg_msgs                                   0                   0                  0   UNLIMITED
ges_big_msgs                                   0                   0                  0   UNLIMITED
ges_rsv_msgs                                   0                   0                  0           0
gcs_resources                                  0                   0                  0   UNLIMITED
gcs_shadows                                    0                   0                  0   UNLIMITED
smartio_overhead_memory                  171,032                   0                  0   UNLIMITED
smartio_buffer_memory                          0                   0                  0   UNLIMITED
smartio_metadata_memory                        0                   0                  0   UNLIMITED
smartio_sessions                               1                   0                  0   UNLIMITED
dml_locks                                     29                   0               2480   UNLIMITED
temporary_table_locks                          5                   0          UNLIMITED   UNLIMITED
transactions                                   6                   0                620   UNLIMITED
branches                                       0                   0                620   UNLIMITED
cmtcallbk                                      3                   0                620   UNLIMITED
max_rollback_segments                         22                  22                620       65535
sort_segment_locks                             9                   3          UNLIMITED   UNLIMITED
k2q_locks                                      0                   0               1128   UNLIMITED
max_shared_servers                             2                   1          UNLIMITED   UNLIMITED
parallel_max_servers                          26                  26                 32       32767


I’ve listed the columns in an order that doesn’t match the view definition because the thing we’re likely to be interested in is the maximum utilisation we’ve seen since instance startup – so that’s the one I’ve reported first. After that I’ve reported the current utilisation, and then the initial allocation (the value that, in many cases, is set by one of the startup parameters), followed by the limiting value for the resource.

In some cases the initial allocation is the same as the limiting value – processes and sessions (mapping to v$process/x$ksupr and v$session/x$ksuse) are good examples of this – these are fixed arrays defined when the instance starts – but in some cases the initial allocation is only a “reasonable” starting guess which allows Oracle to extend on demand, often through segmented arrays of 16 entries at a time, and in some cases there is no final limit to what resources you’re allowed (until the instance crashes with an ORA-04030 error, of course).

So, for example, to answer the question posed at the start of this note – should you increase the hidden parameter _enqueue_locks ? Certainly not for this instance because we can see the initial allocation is 6,644 and we’ve only reached a maximum of 94 enqueue locks allocated simultaneously.

AWR / Statspack

As with most of the dynamic performance views, there’s a summary report of v$resource_limit in the AWR (or Statspack) reports. The type of output you get is as follows;

Resource Limit Stats                            DB/Inst: OR19/or19  Snap: 4576
-> Only rows with Current or Maximum Utilization > 80% of Limit are shown
-> For "UNLIMITED" resources, rows whose Current or Maximum Utilization
   exceeds 2*Initial Allocation are shown
-> Ordered by Resource Name

                                  Current      Maximum     Initial
Resource Name                   Utilization  Utilization Allocation   Limit
------------------------------ ------------ ------------ ---------- ----------
smartio_overhead_memory                   0      171,032          0  UNLIMITED
smartio_sessions                          0            1          0  UNLIMITED
                          ------------------------------------------------------

As you can see the output is strictly limited by an 80% “stress” condition – but there’s more data available if you query the dba_hist_resource_limit view (or wrh$_resource_limit table, or Statspack equivalents) directly. For example:

select
        resource_name,
        max_utilization,
        current_utilization,
        lpad(initial_allocation,18)     initial_allocation,
        lpad(limit_value,11)            limit_value
from
        wrh$_resource_limit
where
        snap_id         =  4849
and     instance_number =  1
and     dbid            =  3091945231
and     max_utilization != 0
;


RESOURCE_NAME                    MAX_UTILIZATION CURRENT_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
-------------------------------- --------------- ------------------- ------------------ -----------
cmtcallbk                                      2                   0                620   UNLIMITED
dml_locks                                     28                  19               2480   UNLIMITED
enqueue_locks                                 68                  41               6644        6644
enqueue_resources                             62                  55               2700   UNLIMITED
max_rollback_segments                         22                  22                620       65535
parallel_max_servers                          12                   8                 32       32767
processes                                     99                  69                360         360
sessions                                     123                  87                564         564
transactions                                   4                   3                620   UNLIMITED

I’ve limited the history query to rows with non-zero utilization, which is why it’s showing so few rows, but It’s an interesting oddity that the dba_hist_resource_limit view (and the underlying table) does actually hold fewer statistics than the initial dynamic performance view. In a 12.1.0.2 (RAC) instance I found that 3 of the dynamic statistics had not been captured in the history, in 19.3 this had gone up to 5 missing statistics, namely:

Not captured in 12.1.0.2 (RAC) history 

max_shared_servers
sort_segment_locks
temporary_table_locks

Further items not captured in 19.3.0.0 
ges_procs
ges_rsv_msgs

It’s possible, of course, that the two extra stats from 19.3 would be captured in a RAC system.

Finally you might want to run a query through history if you needed to see if there was a pattern to some unexpected change in resource utilisation, so a query running through time might be useful, e.g.:

select  * 
from    (
        select 
                ss.end_interval_time,
                res.resource_name, res.max_utilization, res.current_utilization 
        from 
                dba_hist_resource_limit res,
                dba_hist_snapshot       ss
        where 
                ss.end_interval_time between to_date('18-Mar-2021 10:00','dd-mon-yyyy hh24:mi')
                                and     to_date('18-Mar-2021 22:30','dd-mon-yyyy hh24:mi')
        and     res.snap_id = ss.snap_id
        and     res.resource_name in ('sessions','processes','transactions')
        )       piv
        pivot   (
                        avg(max_utilization)     as max,
                        avg(current_utilization) as cur
                for     resource_name in (
                                'sessions'      as sess,
                                'processes'     as proc,
                                'transactions'  as trns
                        )
                )
order by
        end_interval_time
/


END_INTERVAL_TIME          SESS_MAX   SESS_CUR   PROC_MAX   PROC_CUR   TRNS_MAX   TRNS_CUR
------------------------ ---------- ---------- ---------- ---------- ---------- ----------
18-MAR-21 10.00.46.698          123         84         99         66          4          3
18-MAR-21 11.00.59.610          123         84         99         67          4          2
18-MAR-21 12.00.13.015          123         82         99         66          4          2
18-MAR-21 13.00.27.443          123         82         99         66          4          1
18-MAR-21 14.00.40.316          123         84         99         66          4          2
18-MAR-21 15.00.51.705          123         80         99         64          4          1
18-MAR-21 16.00.57.293          123         84         99         66          4          1
18-MAR-21 17.00.03.197          123         80         99         65          4          2
18-MAR-21 18.00.09.448          123         81         99         65          4          2
18-MAR-21 19.00.16.419          123         82         99         66          4          2
18-MAR-21 20.00.22.669          123         81         99         65          4          2
18-MAR-21 21.00.31.215          123         83         99         66          4          1
18-MAR-21 22.00.43.615          123         86         99         68          4          2

In my particular case there’s absolutely nothing interesting to see, but the sort of thing you might spot is a steady growth in the maximum number of sessions over a couple of hours one day, then a subsequent repeated decrease and increase (to that max) in the current number of sessions from then on. It’s always a little difficult when you have a statistic that is “maximum since startup” so you would have to be a little careful in interpreting the results of a query like this.

 

March 12, 2021

Distributed Sequences

Filed under: distributed,Execution plans,Oracle,Performance,Problem Solving — Jonathan Lewis @ 9:09 am GMT Mar 12,2021

A request for help came up some time ago on ODC reporting a query that was hanging when it included a sequence.nextval. In fact the intial “query” was an “insert as select” with a select that was a join of two remote tables.

Making the fairly automatic assumption that many people say “hanging” when they really mean “hasn’t finished yet” the first thought I had about the structure of the statement was that it was just a variation of the standard problem of distributed DML. I haven’t written anything previously about how using sequences can introduce the problem so here’s a note to demonstrate the issue and suggest a workaround:

We start with a database link and a few tables:


rem
rem     Script:         distributed_sequence.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2019
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             18.3.0.0
rem             12.2.0.1
rem

rem     create public database link orcl@loopback using 'orcl';

define m_target=orcl@loopback

create sequence s1 cache 10000;
select s1.nextval from dual;

create table t1
segment creation immediate
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        rownum                          n1,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1
;

create table t2
nologging
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        rownum                          n2,
        lpad(rownum,10,'0')             v2,
        lpad('x',100,'x')               padding
from
        generator       v1
;

create table t3(
        id1     number(6,0),
        id2     number(6,0),
        n0      number(6,0),
        n1      number(6,0),
        n2      number(6,0),
        v1      varchar2(10),
        v2      varchar2(10)
)
segment creation immediate
;

create or replace view v1 as
select
        t1.id id1,
        t2.id id2,
        t1.n1,
        t2.n2,
        t1.v1,
        t2.v2
from
        t1, t2
where
        t2.id = t1.id
;

The command to create a public database link (one example of the many optional commands in my original source) has to be run by a suitably privileged schema as a one-off event. The definition of the substitution variable m_target that I’ll be using as my database link (again with many possibilities in my original script) means I only have to edit my choice of database link once in my script as I change Oracle instances. You’ll notice I’ve done my usual trick of using a loopback database link to emulate a distributed system.

I’ve then created two populated tables (t1, t2) and a third empty table (t3) that will be the target of an insert. I’ve also created a view (v1) that joins the first two tables and a sequence (s1) that I’ve primed with a single call to nextval. When I get to the tests I’ll be using t1, t2 and v1 as if they had been created in the remote database (referenced through my loopback database link) while t3 and s1 will be local objects.

So let’s run a couple of statements and see what their execution plans look like:


set serveroutput off

prompt  =======================
prompt  Insert without sequence
prompt  =======================

insert into t3 (id1, id2, n0, n1, n2, v1, v2)
select
        t1.id,
        t2.id,
        0,
        t1.n1,
        t2.n2,
        t1.v1,
        t2.v2
from
        t1@&m_target    t1,
        t2@&m_target    t2
where
        t2.id = t1.id
;

select * from table(dbms_xplan.display_cursor(format=>'-plan_hash'));

prompt  =======================
prompt  Insert WITH sequence
prompt  =======================

insert into t3 (id1, id2, n0, n1, n2, v1, v2)
select
        t1.id,
        t2.id,
        s1.nextval,
        t1.n1,
        t2.n2,
        t1.v1,
        t2.v2
from
        t1@&m_target    t1,
        t2@&m_target    t2
where
        t2.id = t1.id
;

select * from table(dbms_xplan.display_cursor(format=>'-plan_hash'));
commit;

The code joins t1 and t2, selects a couple of columns and inserts then into t3 without or with a call to s1.nextval (the local sequence). Here’s the output (with minor cosmetic changes) from executing this code under 19.3.0.0 (the effects are the same in 18.3.0.0 and 12.2.0.1):


=======================
Insert without sequence
=======================

10000 rows created.


SQL_ID  373nz810u3frv, child number 0
-------------------------------------
insert into t3 (id1, id2, n0, n1, n2, v1, v2) select  t1.id,  t2.id,
0,  t1.n1,  t2.n2,  t1.v1,  t2.v2 from  t1@orclpdb@loopback t1,
t2@orclpdb@loopback t2 where  t2.id = t1.id

-----------------------------------------------------------------
| Id  | Operation                | Name | Cost  | Inst   |IN-OUT|
-----------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |     1 |        |      |
|   1 |  LOAD TABLE CONVENTIONAL | T3   |       |        |      |
|   2 |   REMOTE                 |      |       | ORCLP~ | R->S |
-----------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------
   2 - SELECT "A2"."ID","A1"."ID",0,"A2"."N1","A1"."N2","A2"."V1","A1"."
       V2" FROM "T1" "A2","T2" "A1" WHERE "A1"."ID"="A2"."ID" (accessing
       'ORCLPDB.LOCALDOMAIN@LOOPBACK' )

Note
-----
   - cpu costing is off (consider enabling it)


=======================
Insert WITH sequence
=======================

10000 rows created.


SQL_ID  8jg23arujnh01, child number 0
-------------------------------------
insert into t3 (id1, id2, n0, n1, n2, v1, v2) select  t1.id,  t2.id,
s1.nextval,  t1.n1,  t2.n2,  t1.v1,  t2.v2 from  t1@orclpdb@loopback
t1,  t2@orclpdb@loopback t2 where  t2.id = t1.id

-------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |       |       |    54 (100)|          |        |      |
|   1 |  LOAD TABLE CONVENTIONAL | T3   |       |       |            |          |        |      |
|   2 |   SEQUENCE               | S1   |       |       |            |          |        |      |
|*  3 |    HASH JOIN             |      | 10000 |   937K|    54   (8)| 00:00:01 |        |      |
|   4 |     REMOTE               | T1   | 10000 |   468K|    26   (4)| 00:00:01 | ORCLP~ | R->S |
|   5 |     REMOTE               | T2   | 10000 |   468K|    26   (4)| 00:00:01 | ORCLP~ | R->S |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."ID"="T1"."ID")

Remote SQL Information (identified by operation id):
----------------------------------------------------
   4 - SELECT /*+ OPAQUE_TRANSFORM */ "ID","N1","V1" FROM "T1" "A2" (accessing
       'ORCLPDB.LOCALDOMAIN@LOOPBACK' )

   5 - SELECT /*+ OPAQUE_TRANSFORM */ "ID","N2","V2" FROM "T2" "A1" (accessing
       'ORCLPDB.LOCALDOMAIN@LOOPBACK' )

The key thing to notice is that when we want to insert the local sequence as a column in the select list Oracle breaks the hash join into two separate accesses to the remote database and pulls all the data we might need from the two tables before trying to join them locally. When the query is “fully remote” the local database can let the remote database deal with the join, when the query is distributed – which is a side effect of introducing the sequence – the local site becomes the driving site and has to work out the least worst way of handling the join, which might be much slower than the remote join.  (It’s an odd little quirk that when the select is fully remote the optimizer thinks that it’s not using CPU costing. Possibly that’s because all the arithmetic happens somewhere else and the local cost of the query never gets above zero.)

This is one of those cases where we might work around the problem by creating a remote view to handle the join – hence the creation of the view v1; here’s a suitable statement and the resulting execution plan:

prompt  ==============================
prompt  Insert using view and sequence
prompt  ==============================

set serveroutput off

insert into t3 (id1, id2, n0, n1, n2, v1, v2)
select
        v1.id1,
        v1.id2,
        s1.nextval,
        v1.n1,
        v1.n2,
        v1.v1,
        v1.v2
from
        v1@&m_target v1
;

select * from table(dbms_xplan.display_cursor(format=>'-plan_hash'));
commit;



==============================
Insert using view and sequence
==============================

10000 rows created.


SQL_ID  4tz0rrqt87nb8, child number 0
-------------------------------------
insert into t3 (id1, id2, n0, n1, n2, v1, v2) select  v1.id1,  v1.id2,
s1.nextval,  v1.n1,  v1.n2,  v1.v1,  v1.v2 from  v1@orclpdb@loopback v1

-------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |       |       |    27 (100)|          |        |      |
|   1 |  LOAD TABLE CONVENTIONAL | T3   |       |       |            |          |        |      |
|   2 |   SEQUENCE               | S1   |       |       |            |          |        |      |
|   3 |    REMOTE                | V1   | 10000 |   937K|    27   (8)| 00:00:01 | ORCLP~ | R->S |
-------------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------
   3 - SELECT /*+ OPAQUE_TRANSFORM */ "ID1","ID2","N1","N2","V1","V2" FROM "V1" "V1"
       (accessing 'ORCLPDB.LOCALDOMAIN@LOOPBACK' )

As you can see, the local optimizer doesn’t know enough about the remote view to be able to split it into components and make a mess of the execution plan, it simply sees a query against a “single table” and leaves the remote database to worry about optimising it. But, of course, we don’t always have the luxury of being able to create objects on someone else’s database, so what’s the alternative?

Try rewriting the query to use an inline view with the /*+ no_merge() */ hint:


prompt  =======================================
prompt  Insert from a no-merge inline view with
prompt  a sequence.nextval in the outer query
prompt  ========================================

set serveroutput off

insert into t3 (id1, id2, n0, n1, n2, v1, v2)
select
        id1, id2, s1.nextval, n1, n2, v1, v2
from    (
        select  /*+ no_merge */
                t1.id   id1,    
                t2.id   id2,
                0       n0,
                t1.n1,
                t2.n2,
                t1.v1,
                t2.v2
        from
                t1@&m_target    t1,
                t2@&m_target    t2
        where
                t2.id = t1.id
        )       v1
;

select * from table(dbms_xplan.display_cursor(format=>'-plan_hash'));
commit;


=======================================
Insert from a no-merge inline view with
a sequence.nextval in the outer query
========================================

10000 rows created.


SQL_ID  20z81g550tbsk, child number 0
-------------------------------------
insert into t3 (id1, id2, n0, n1, n2, v1, v2) select  id1, id2,
s1.nextval, n1, n2, v1, v2 from (  select /*+ no_merge */   t1.id id1,
 t2.id id2,   0 n0,   t1.n1,   t2.n2,   t1.v1,   t2.v2  from
t1@orclpdb@loopback t1,   t2@orclpdb@loopback t2  where   t2.id = t1.id
 ) v1

-------------------------------------------------------------------------------------------------
| Id  | Operation                | Name | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |      |       |       |    54 (100)|          |        |      |
|   1 |  LOAD TABLE CONVENTIONAL | T3   |       |       |            |          |        |      |
|   2 |   SEQUENCE               | S1   |       |       |            |          |        |      |
|   3 |    VIEW                  |      | 10000 |   937K|    54   (8)| 00:00:01 |        |      |
|   4 |     REMOTE               |      |       |       |            |          | ORCLP~ | R->S |
-------------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------
   4 - SELECT /*+ NO_MERGE */ "A2"."ID","A1"."ID",0,"A2"."N1","A1"."N2","A2"."V1","A1"."V
       2" FROM "T1" "A2","T2" "A1" WHERE "A1"."ID"="A2"."ID" (accessing
       'ORCLPDB.LOCALDOMAIN@LOOPBACK' )


The hint has done exactly what we needed: it has passed the text of the inline view to the remote database for optimisation so the join happens remotely, and the sequence number is then included after the result set comes back from the remote database. The SQL is a little messier, of course, mostly thanks to the doubled appearance of the columns in the select list.

Summary

Using a local sequence in DML that accesses a remote database makes the optimizer treats the underlying query as a distributed query, and this may mean it can’t find an efficient execution path unless you do some re-engineering of the code. If you can manage to make an insert with a constant efficient then using that version of the code as an in-line no_merge view with one extra layer that brings the sequence into play may be all you need to do to make the DML operate efficiently.

Next Page »

Website Powered by WordPress.com.