Don't use the force Luke!



This week revealed another issue with our datawarehouse - this time with a statement that hadn't been a problem before but suddenly starting performing very badly (from maybe 10 minutes to 10 hours) .

So i set about investigating, an initial look showed the SQL was about 200 lines long...... however at second glance the statement was actually quite simple - it's simply a two table join with an aggregate - there are just loads of columns involved and the plan looks more complex as there is a lot of parallel processing going on.

The initial statement looks like this in cloud control


Which at first look doesn't seem that bad, however looking more closely reveals that the optimizer has got it badly wrong with its estimates from one of the tables - DA_STATIC_TRADE.

The index range scan only has a cost of 2 - looking at the data there is actually a huge amount of data in here and actually the table is partitioned by day and we are only interested on 1 day - and indeed most of that data in that day as even when applying the other filter it's still retrieving about 80% of the table.

What we actually want is two complete full table scans of the single partition (both tables involved are interval partitioned by day) followed by a hash join to give us the results back as quickly as possible (and with some parallel thrown in for good measure)

So what's going wrong?

My first though was stats, but a look through the table/index/histogram stats all looked good - so what on earth is going on?

Well a closer look at the cloud control screen shows this


If you notice what should be literals (as that's what is passed in to the query - and is generally desired 100% of the time in a warehouse as we want to give the optimizer as much information as possible)
actually seem to be binds......

Right.....

I know what the issue is here - cursor sharing - a quick check reveals it's been changed - it's set to FORCE.

Let's now do a quick test and switch it back to EXACT.

Now we get this (i.e. the literal shows correctly)



Which is exactly what we want - the costs look higher (i.e. its got them right) and we now get the correct plan - the statement is now back to 10 minutes.

It turned out the parameter was changed to try and resolve another issue and we quickly got it changed back.

A couple of lessons from this:

1) beware the impact of changing a global optimizer setting on a running system - you could bring it to its knees
2) Don't use the FORCE- use the EXACT!

And in fact the recommendation from Oracle themselves is to never use FORCE any more - it does have lots of nasty side effects - bad plans is just one of them.....


Comments