Tuning is sometimes all about the data......



Now this week I've really had to think when performing some SQL tuning, my usual trick of just looking at some SQL and just saying 'add this index' or do some other clever stuff just wasn't cutting it - i had to actually start understanding the data and what was actually going on causing the 'slowness'.

Now this database has form for troublesome queries, it's already huge (tens of TB), makes excessive use of parallel query and is 11.2 (which is actually quite old now - surprising how fast that happens) So i knew this could well be hard to solve.

It turns out that there was an original single query that was written to try and perform this bit of business process - but that SQL never finished, so the decision was made to subdivide the dataset up to make it more manageable and allow the job to complete in some reasonable time.

This was the query i was presented with as the problem (the one that was fixing the other problem....)

The basic SQL was this (shown in small font to make it vaguely readable)

INSERT
        /*+ APPEND */
INTO "SB_DWH_IN"."IN_NON_LIN_PRE_POS_HOURLY_CURR"
        (
                "TRADE_SID"                  ,
                "SRC_SYS_CODE"               ,
                "SRC_SYS_TRADE_CODE"         ,
                "SRC_SYS_TRADE_VERSION_CODE" ,
                "POS_START_DATE"             ,
                "POS_END_DATE"               ,
                "VOLUME_UNIT_CODE"           ,
                "VOLUME"                     ,
                "PRICE"                      ,
                "INSERT_PROCESS_ID"          ,
                "INSERT_TS"
        )
SELECT "IN_NON_LINEAR_SPLIT_TRADES"."TRADE_SID" TRADE_SID                                 ,
        "IN_NON_LINEAR_POS_SPLIT"."SRC_SYS_CODE" SRC_SYS_CODE                             ,
        "IN_NON_LINEAR_POS_SPLIT"."SRC_SYS_TRADE_CODE" SRC_SYS_TRADE_CODE                 ,
        "IN_NON_LINEAR_POS_SPLIT"."SRC_SYS_TRADE_VERSION_CODE" SRC_SYS_TRADE_VERSION_CODE ,
        "META_TEMPORAL_HOURS"."START_TS" POS_START_DATE                                   ,
        "META_TEMPORAL_HOURS"."END_TS" POS_END_DATE                                       ,
        "IN_NON_LINEAR_POS_SPLIT"."VOLUME_UNIT_CODE" VOLUME_UNIT_CODE                     ,
        "IN_NON_LINEAR_POS_SPLIT"."VOLUME" VOLUME                                         ,
        "IN_NON_LINEAR_POS_SPLIT"."PRICE" PRICE                                           ,
        77308 INSERT_PROCESS_ID                                                           ,
        sysdate INSERT_TS
FROM ("SB_DWH_IN"."IN_NON_LINEAR_SPLIT_TRADES" "IN_NON_LINEAR_SPLIT_TRADES"
        INNER JOIN "SB_DWH_IN"."IN_NON_LINEAR_POS_SPLIT" "IN_NON_LINEAR_POS_SPLIT"
        ON
                (
                        "IN_NON_LINEAR_SPLIT_TRADES"."TRADE_SID" = "IN_NON_LINEAR_POS_SPLIT"."TRADE_SID"
                )
        )
        INNER JOIN "SB_DWH_META"."META_TEMPORAL_HOURS" "META_TEMPORAL_HOURS"
        ON
                (
                        "META_TEMPORAL_HOURS"."GRANULARITY" = "IN_NON_LINEAR_POS_SPLIT"."SPLIT_GRANULARITY"
                )
                AND
                (
                        "META_TEMPORAL_HOURS"."START_TS" >= "IN_NON_LINEAR_POS_SPLIT"."POS_START_DATE"
                )
                AND
                (
                        "META_TEMPORAL_HOURS"."START_TS" < "IN_NON_LINEAR_POS_SPLIT"."POS_END_DATE"
                )
WHERE
        (
                "IN_NON_LINEAR_SPLIT_TRADES"."GENKEY" > 33700
        )
        AND
        (
                "IN_NON_LINEAR_SPLIT_TRADES"."GENKEY" <= 33800
        )

At first glance this looks like a medium complexity query and when it runs we see an execution plan like below


Which honestly at first glance does look reasonable (the parallel by the way is caused by the session running 'alter session force parallel query parallel 8). The actual numbers are slightly different here for this test case - the actual tables were much larger - the plan however is exactly the same.

Watching the statement running through sql monitoring reveals that almost all the time is spent in the hash join step burning cpu - this was the part that needed solving - why was that so bad?

I messed around trying various things at this point with limited success - before i started to dig a little deeper into the data. I discovered that the join between the very large table (200M rows in the real case) and the medium table (5M rows) was done on 3 columns. One of those was an equality condition and the other on a range date condition.

Now this is where it got interesting - the equality condition was on a very non selective column - about 25% of the rows in the 5M row table had that value - the date range however was very selective.

Looking again at the execution plan reveals what is going on


Here i have marked the relevant section - the hash join step contains 2 parts - an access step (only pulling rows back that match this criteria - labelled 'A') followed by a filter step (to filter the data in memory to match the other criteria - labelled 'F').

In this case we were bringing back over a million rows into the hash join to then filter out maybe only the 2 or 3 that match the date range - this is being done completely backwards - we want to access on the date range and then filter by the equality condition.

OK - so we know the issue now how to fix it?

On doing some reading up i came across a note from MOS (which I'm still not sure if it is correct or not ) that states that hash join cannot be done on range predicates. This perhaps explains why we don't seem to be able to get the access/filter steps the right way round.

So where does that leave us - is this solvable?

Well yes it is with hints - specifically in this case force a nested loop join from the big table to the small one on an index which has the date column first in it,

So without reposting all of the code - the hint i added is this

SELECT /*+ LEADING("IN_NON_LINEAR_POS_SPLIT") USE_NL("IN_NON_LINEAR_POS_SPLIT" "META_TEMPORAL_HOURS") */"

This tells the query to start with the big table then do a nested loop join to the medium table.

Now this fixed the query (well at least made it run in a reasonable time) but now i'd 'solved' the issue i decided that we could probably get the whole splitting of the data into multiple batches scrapped - we could probably just do the whole thing in a single statement and make things simple and faster.

So i rewrote the code to do that - this simplified it somewhat and we could see the overall time was much better - however by this time i was on a roll and decided to go further......

So i ended up doing the following

1) get index dropped on destination table
2) enable parallel dml (as well as parallel query) - an easily forgotten step
3) partition the final destination table.....

So the SQL steps ended up looking like this

alter session force parallel query parallel 16;
alter session enable parallel dml;
alter session force parallel dml parallel 16;

INSERT /*+ APPEND  */ INTO "SB_DWH_IN"."DEMO" ( "TRADE_SID" , "SRC_SYS_CODE" , "SRC_SYS_TRADE_CODE" , "SRC_SYS_TRADE_VERSION_CODE" , "POS_START_DATE" , "POS_END_DATE" , "VOLUME_UNIT_CODE" , "VOLUME" , "PRICE" , "INSERT_PROCESS_ID" , "INSERT_TS" ) 
SELECT /*+ LEADING("IN_NON_LINEAR_POS_SPLIT") USE_NL("IN_NON_LINEAR_POS_SPLIT" "META_TEMPORAL_HOURS") */"IN_NON_LINEAR_POS_SPLIT"."TRADE_SID" TRADE_SID , "IN_NON_LINEAR_POS_SPLIT"."SRC_SYS_CODE" SRC_SYS_CODE , "IN_NON_LINEAR_POS_SPLIT"."SRC_SYS_TRADE_CODE" SRC_SYS_TRADE_CODE , "IN_NON_LINEAR_POS_SPLIT"."SRC_SYS_TRADE_VERSION_CODE" SRC_SYS_TRADE_VERSION_CODE , "META_TEMPORAL_HOURS"."START_TS" POS_START_DATE , "META_TEMPORAL_HOURS"."END_TS" POS_END_DATE , "IN_NON_LINEAR_POS_SPLIT"."VOLUME_UNIT_CODE" VOLUME_UNIT_CODE , "IN_NON_LINEAR_POS_SPLIT"."VOLUME" VOLUME , "IN_NON_LINEAR_POS_SPLIT"."PRICE" PRICE , 99999 INSERT_PROCESS_ID , sysdate INSERT_TS
FROM  "SB_DWH_IN"."IN_NON_LINEAR_POS_SPLIT" "IN_NON_LINEAR_POS_SPLIT"  INNER JOIN "SB_DWH_META"."META_TEMPORAL_HOURS" "META_TEMPORAL_HOURS" ON 
( "META_TEMPORAL_HOURS"."START_TS" >= "IN_NON_LINEAR_POS_SPLIT"."POS_START_DATE" ) AND
( "META_TEMPORAL_HOURS"."START_TS" < "IN_NON_LINEAR_POS_SPLIT"."POS_END_DATE" ) AND
( "META_TEMPORAL_HOURS"."GRANULARITY" = "IN_NON_LINEAR_POS_SPLIT"."SPLIT_GRANULARITY" );

This then creates a plan looking like this



The result of those statements is this:

192739978 rows created.

Elapsed: 00:03:50.07

Pretty good i thought (even if i do say so myself) - I was in danger of developing CPD (compulsive tuning disorder) by the end of this exercise - but i decided less than 4 mins to join a huge dataset and insert 200M rows was pretty good.......

Some comments on this though as there are several useful points here.

1. Sometimes the data is really important - you need to understand it and its relationships to other tables - the optimizer can;t get everything right!
2. Indexes on huge data inserts really do a have a large impact (up to 50% of the time in my testing for this example) - this is almost ignored for OLTP but for DW is a huge factor
3. Parallel query/dml can do amazing things - but be careful it can also do terrible things - you have to understand what you are doing
4. Sometimes simplifying a query can fix a lot of the issues - but again you have to understand the data sometimes to understand that the query can be simplified
5 parallel dml can help hugely and its not widely used - it has to be explicitly enable (alter session enable parallel dml). It's also not that obvious whether it is being used or not - in the sql monitoring screens above the final statement is using it - this can be seen as the 'LOAD as SELECT' step is below the PX step - meaning that each individual slave is inserting - rather than the original example where the co-ordinator is being sent all the rows to insert itself. This can be another huge win to enable.
6. Sometimes you have to hint (or kick the optimizer some other way) to get the desired result. in the example above the cost for the very fast result is way higher than the slow one - again the optimizer is not perfect. It can't find the optimal solution all the time.


2 comments:

  1. Hi,
    have You tried with other 'high throughput' approatches like use_merge ?
    Regards
    G

    ReplyDelete
  2. Hi G,
    Honestly...no. When the NL was giving me pretty decent performance i didn't look any further. I've very rarely used sort merge joins to be honest. I'm not even certain in what circumstances the CBO uses them.....

    Cheers,
    Rich

    ReplyDelete