When upgrades go wrong - rman to the rescue (with some quirks)

We recently tried to patch an 11.2.0.2.2 database up to 11.2.0.3.7, following the normal manual upgrade process everything was going really well until it tried to upgrade APEX - when it tried this we hit this error:

Error creating page name="Page Zero" id="0" ORA-02291: integrity constraint
(APEX_040200.WWV_FLOW_STEP_UI_FK) violated - parent key not found


This seems to be a known issue with export/import in Apex 4. The fix is to 'manipulate' the "p_user_interface_id => " part of the plsql call - however in our case the upgrade process itself was generating the code so we couldn't change it

So.....

We have to restore the whole 1TB database and roll forward through 1 week of archive logs to get us back to where we were at the start - so how do we do that.......?

First up lets fire up rman and run what i thnk is needed as a first attempt:

run {
SET UNTIL TIME "to_date('13-AUG-2013 13:00:00','dd-mon-yyyy hh24:mi:ss')";
restore database;
recover database delete archivelog maxsize 150000M;
}

From past experience i know that restoring a lot of archive logs can fill up the filesystem and stop the restore so i add the maxsize option to make rman delete as it goes along and keep below the size of the archive area filesystem.

This all looked good and was left overnight - 12 hours later i came in and:

Recovery Manager: Release 11.2.0.2.0 - Production on Tue Aug 13 17:52:27 2013

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: ODIN2T (not mounted)
connected to recovery catalog database

RMAN> run {
SET UNTIL TIME "to_date('13-AUG-2013 13:00:00','dd-mon-yyyy hh24:mi:ss')";
restore database;
recover database delete archivelog maxsize 150000M;
} 2> 3> 4> 5>

executing command: SET until clause

Starting restore at 13-AUG-13
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: SID=208 device type=SBT_TAPE
channel ORA_SBT_TAPE_1: Data Protection for Oracle: version 5.4.1.0
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=291 device type=DISK

channel ORA_SBT_TAPE_1: starting datafile backup set restore
channel ORA_SBT_TAPE_1: specifying datafile(s) to restore from backup set


some log removed

channel ORA_SBT_TAPE_1: restored backup piece 1
channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:00:15
Finished restore at 14-AUG-13

Starting recover at 14-AUG-13
using channel ORA_SBT_TAPE_1
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 08/14/2013 05:30:32
RMAN-06079: database must be mounted to perform recovery

RMAN>


So - I can;t recover as the database isn't mounted - fair enough (quite a few notes on google fail to mention this part......)

So lets mount it:

RMAN> sql "alter database mount";


RMAN> sql "alter database mount";

sql statement: alter database mount
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of sql command on default channel at 08/14/2013 07:55:53
RMAN-11003: failure during parse/execution of SQL statement: alter database mount
ORA-00201: control file version 11.2.0.3.0 incompatible with ORACLE version 11.1.0.0.0
ORA-00202: control file: '/oracle/ODIN2T/oradata/control01.ctl'




Oh dear..... as part of the patch to 11.2.0.3 we realised compatible was wrong and we changed it to 11.2. We then change the value back to 11.1 before we did the restore as the backups were done at this compatibility and we weren;t sure if that would have any 'funny' effects. The problem we have now is compatible is 11.1 but the controlfile (which we didnt restore) is still in 11.2 format.

We decided to restore the controlfile rather than messing around with changing compatible again (though this may of worked it seemed more risky).

SO we restore the controlfile as below:

run {
SET UNTIL TIME "to_date('13-AUG-2013 13:00:00','dd-mon-yyyy hh24:mi:ss')";
restore controlfile;


Recovery Manager: Release 11.2.0.2.0 - Production on Wed Aug 14 07:58:16 2013

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

connected to target database: ODIN2T (not mounted)
connected to recovery catalog database

RMAN> run {
SET UNTIL TIME "to_date('13-AUG-2013 13:00:00','dd-mon-yyyy hh24:mi:ss')";
restore controlfile;
} 2> 3> 4>

executing command: SET until clause

Starting restore at 14-AUG-13
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: SID=208 device type=SBT_TAPE
channel ORA_SBT_TAPE_1: Data Protection for Oracle: version 5.4.1.0
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=291 device type=DISK

channel ORA_SBT_TAPE_1: starting datafile backup set restore
channel ORA_SBT_TAPE_1: restoring control file
channel ORA_SBT_TAPE_1: reading from backup piece cf_auto_c-1660885686-20130813-6e
channel ORA_SBT_TAPE_1: piece handle=cf_auto_c-1660885686-20130813-6e tag=TAG20130813T124543
channel ORA_SBT_TAPE_1: restored backup piece 1
channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:01:05
output file name=/oracle/ODIN2T/oradata/control01.ctl
output file name=/oracle/ODIN2T/origlog/control02.ctl
output file name=/oracle/ODIN2T/mirrlog/control03.ctl
Finished restore at 14-AUG-13

RMAN>




Now we try and mount:

RMAN>  sql "alter database mount";

sql statement: alter database mount
released channel: ORA_SBT_TAPE_1
released channel: ORA_DISK_1

RMAN>

There was then a really long pause and no confirmation it was mounted (though the prompt came back). So did it work?

RMAN>  sql "alter database mount";

sql statement: alter database mount
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of sql command on default channel at 08/14/2013 08:03:28
RMAN-11003: failure during parse/execution of SQL statement: alter database mount
ORA-01100: database already mounted

RMAN>


Yes it did - ok now lets try the recover again now that we are mounted.

RMAN> run {
SET UNTIL TIME "to_date('13-AUG-2013 13:00:00','dd-mon-yyyy hh24:mi:ss')";
recover database delete archivelog maxsize 150000M;
}  2> 3> 4>

executing command: SET until clause

Starting recover at 14-AUG-13
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: SID=208 device type=SBT_TAPE
channel ORA_SBT_TAPE_1: Data Protection for Oracle: version 5.4.1.0
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=291 device type=DISK

starting media recovery

channel ORA_SBT_TAPE_1: starting archived log restore to default destination
channel ORA_SBT_TAPE_1: restoring archived log
archived log thread=1 sequence=94899
channel ORA_SBT_TAPE_1: restoring archived log
archived log thread=1 sequence=94900
channel ORA_SBT_TAPE_1: restoring archived log
archived log thread=1 sequence=94901
channel ORA_SBT_TAPE_1: restoring archived log
archived log thread=1 sequence=94902
channel ORA_SBT_TAPE_1: restoring archived log
archived log thread=1 sequence=94903
channel ORA_SBT_TAPE_1: reading from backup piece redolog_ODIN2T_822790837_20130807_69638_1.rman
channel ORA_SBT_TAPE_1: piece handle=redolog_ODIN2T_822790837_20130807_69638_1.rman tag=TAG20130807T010036



Ok looking good - so now we wait...

And it's not long before we get this:

archived log file name=/oracle/ODIN2T/oraarch/1_94909_739557160.arc RECID=94206 STAMP=823421802
failover to previous backup
unable to find archived log
archived log thread=1 sequence=94910
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 08/14/2013 08:18:46
RMAN-06055: could not find archived log with sequence 94910 for thread 1


At this point (after a change of trousers) we try again....and amazingly it works!
This pattern is repeated at random intervals throught the restore process - it seems that at certain points our tape library is busy - and rather than rman signal thats the reason it just says sorry 'file not founf' - which is not at all nice....

We had one other weird error during the restore:

channel ORA_SBT_TAPE_1: restore complete, elapsed time: 00:00:03
archived log file name=/oracle/ODIN2T/oraarch/1_95359_739557160.arc thread=1 sequence=95359
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 08/14/2013 13:32:06
RMAN-06561: available space must be larger than 336612 kb



This is i guess related to the maxsize parameter setting i had - it must of been keeping track of what space it though it had been using and decided it was out of space - very odd. Rerunning the recover and it just carried on.....


media recovery complete, elapsed time: 00:00:03
Finished recover at 14-AUG-13

RMAN>



Hooray - now we open with resetlogs as we have an incomplete recovery to just before the point in time we started the upgrade:




RMAN> alter database open resetlogs;

database opened
new incarnation of database registered in recovery catalog
starting full resync of recovery catalog
full resync complete

RMAN>

And there we go..... all done

so the command at the start should have been:


run {
SET UNTIL TIME "to_date('13-AUG-2013 13:00:00','dd-mon-yyyy hh24:mi:ss')";
restore controlfile;
restore database;
sql "alter database mount";
recover database delete archivelog maxsize 150000M;


In the absence of tape problems and weird rman quirks this would have done it all in one command.....

Now we just have to work out how to get round the apex upgrade issue.....

Comments