Yesterday i had an email from a developer asking me this:
"We're running an extract report and when we run it in plsql developer we get rows back immediately, however when we do a full extract of the entire dataset to a file it takes 2 hours - what's going on? "
(with the subtle hint of course that it is a database issue ant not something they've done wrong).
I first explained that getting a few rows back in an IDE isn't actually doing all the work - oracle has just returned a few rows quickly but hasn't actually done everything you asked it for - so you can't use that as a performance benchmark.
However of course there is an issue - this extract (which incidentally returns 3 million rows - so it's not that small) is taking two hours and that's an issue for the business and we need to get the run time down - so i need to dig a little deeper.
The actual query is just this
select * from firm_group_view where scenario_id = 11295295
which looks very simple at first glance though the name including the word 'view' probably gives you some hint that it's not just a simple matter to get these results.
So looking at the view code it's not overly complex, it's an 8 table join union all'd with an 8 table join - a lot of the tables are small and it's only really 2 big tables that contain the large data volume.
Looking at the actual plan it's using (in sql monitoring here so i get some real time info on what's it's doing where) and the plan actually looks ok - it just seems to be slow at doing it) - extract from that below
Initially i though that changing the join order by use of the leading hint to do the 'big tables first might help leaving just the 'lookup' extra values until the end might help so i went ahead and did that and indeed the plan changed as expected but the performance was much the same - so what was going on?
Well another look at the code revealed the use of a function call in the statement - this is something that the explain plan (and indeed the sqlmonitoring) do not factor in - it doesn't obviously show up at all. The function can be doing a huge amount of work and it only becomes apparent by doing tracing or looking at what other SQL the session is executing. It may appear to be slow on step 'X' in the plan but in fact this is the stage where it's executing the function......
The function in this case was
select
CALCULATE_PHASE_IN_FACTOR(tls.tp_level_id,rmsd.timeslot_local,ms.scenario_date) as Calc_Factor,
So i tried commenting out the function and replacing it with a static value just to see what happened
select
1 as Calc_Factor,
Now when we run the query it's massively faster confirming that the function is indeed the problem.
Whatever it is running in that function is running 3 million times - so something being a tiny bit slow adds up to a lot of time.
So we're making progress - time to investigate the function
So i open it up and something immediately jumps out (and this same construct is used twice in the same function...)
SELECT COUNT(*) INTO V_WORKING_DAYS
FROM (
SELECT rownum rnum
FROM all_objects
WHERE rownum <= I_CUT_OFF_DATE + 1 - ADD_MONTHS((LAST_DAY(I_CUT_OFF_DATE) + 1),-1)
)
WHERE to_char( ADD_MONTHS((LAST_DAY(I_CUT_OFF_DATE) + 1),-1) + rnum-1, 'DY' )
NOT IN ( 'SAT', 'SUN' );
what?
This is a really strange construct to use in 'application' code but i can see what they are doing - they just need something that has a lot of rows to provide a 'dummy' table to work with.
In fact i googled this code as i guessed the developers had borrowed it from somewhere and discovered it in an asktom question from 13 years ago......
What the developers are actually trying to do is work out the number of working days in a set month to be used for later calculations - this is not as trivial as it first sounds and actually there are quite a few solutions posted on the internet for this.
In order to just do a 'quick win' and not mess around too much with the code i decided to replace the use of all objects with a dummy row generator (i'm not sure who came up with this trick originally but i shamelessly borrowed the code from Jonathan's blog to save me some time as i saw him use it at a recent Oraclemidlands event)
So the code now becomes
with temparray as ( select
rownum id
from dual
connect by
level <= 50
)
SELECT COUNT(*) INTO V_WORKING_DAYS
FROM (
SELECT rownum rnum
FROM temparray
WHERE rownum <= LAST_DAY(I_CUT_OFF_DATE) + 1 - ADD_MONTHS((LAST_DAY(I_CUT_OFF_DATE) + 1),-1)
)
WHERE to_char( ADD_MONTHS((LAST_DAY(I_CUT_OFF_DATE) + 1),-1) + rnum-1, 'DY' )
NOT IN ( 'SAT', 'SUN' );
So instead of using all_objects (which has to do quite a bit of work with multiple dictionary tables) we now just make use of 'dual'.
This individual query run directly is now more than a factor of 10 faster (and we run it twice) so when we call it this function should now be 20x faster.
I also considered result_cache at this point but it seems that the data being passed to the function is rarely repeated so that produces no tangible benefit.
The extract process now completes in about 28 minutes
It can still be improved further by better designing the logic in what the function does and avoiding the double call to very similar processing but larger changes like that require more effort and testing and the gain from that is small relative to the effort - we've fixed a large part of the issue and the payoff for more effort is probably not worth it unless this is some absolutely time critical report that can make a huge difference to business process.
An interesting tuning case and a reminder that the explain plan is often not the whole story of what is going on.
Comments
Post a Comment