Now before I tell this story i want to make clear that i still dont 100% understand how this thing got fixed..... There are a number of bizarre errors along the way when normally we'd of just switched off the life support machine but we persevered with it anyway and somehow it came back....
See if you can work out how it actually managed to work as I'm still a little confused.
Ok so here goes - what we wanted to do was flashback a database to a previous guaranteed restore point. Ths database however is greatly abused by the development team who flashback and forth seemingly for fun - there are almost a thousand incarnations of it recorded in the rman catalog.
Anyway thats a litle background - the db was 11.2.0.3 but was patched to 11.2.0.4 in the past couple of weeks.
The restore point was before the patching was done - but that doesnt really seem part of the issue (apart from a minot thing right at the end).
So first off the developers couldnt flash back (whcih has happened before), but previously running the command from rman rather than sqlplus seems to have always resolved the issue - however this time it did not
They were getting
[oracle@server]:DB:/oracle/home/oracle# oerr ora 38754
38754, 00000, "FLASHBACK DATABASE not started; required redo log is not available"
// *Cause: A FLASHBACK DATABASE command did not start. A redo log needed for
// the recovery part of FLASHBACK DATABASE could not be found or
// accessed.
// *Action: See trace file
Lets try it for ourselves
RMAN> flashback database to restore point V_2_6_8_DONE
2> ;
Starting flashback at 24-JAN-14
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=573 device type=DISK
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of flashback command at 01/24/2014 15:18:43
ORA-00600: internal error code, arguments: [krbrfbaf_inv_fno], [9], [], [], [], [], [], [], [], [], [], []
RMAN> exit
Nice error - not seem that one before
Lets try in sqlplus
SQL>
flashback database to restore point V_2_6_8_DONE
*
ERROR at line 1:
ORA-38754: FLASHBACK DATABASE not started; required redo log is not available
ORA-38762: redo logs needed for SCN 1247576 to SCN 1258352
ORA-38761: redo log sequence 1 in thread 1, incarnation 147 could not be
accessed
OK - a more meaningful error - lets see what incarnation 147 is
RMAN> list incarnation of database;
we get loads of output (the thousand i was talking about) and there is no 147 - what? that makes no sense
does it mean the 147th incarantion rather than incarnation number 147 (surely not) - but lets check anyway. Turns out thats over 9months old so it cant be that.
Lets just try a restore command of the archivelogs for those scn's
RMAN> restore archivelog from scn 1247576 until scn 1258352;
That spews out a load of errors, 50% can;t be found in the catalog and 50% it complains tape channel isnt allocated. So even if we allocate tape thats not going to help.
Lets have a look at v$restore_point as I'm still not happy about this incarantion 147 thing
SCN DATABASE_INCARNATION# GUARANTEE STORAGE_SIZE TIME RESTORE_POINT_TIME PRESERVED NAME
---------- --------------------- --------- ------------ --------------------------------------------------------------------------- --------------------------------------------------------------------------- --------- --------------------------------------------------------------------------------------------------------------------------------
920045 1 YES 104857600 08-NOV-13 10.31.48.000000000 AM YES V_2_6_1_DONE
936699 4 YES 2516582400 08-NOV-13 05.18.29.000000000 PM YES V_2_5_65_DONE
944383 7 YES 0 11-NOV-13 12.45.27.000000000 PM YES V_2_5_65_EBLD_DONE
944814 8 YES 943718400 11-NOV-13 01.09.11.000000000 PM YES V_2_5_66_DONE
953980 21 YES 419430400 12-NOV-13 10.12.03.000000000 AM YES V_2_5_67_DONE
1182226 22 YES 2306867200 12-NOV-13 03.08.06.000000000 PM YES V_2_5_67_JVM_EBLD_DONE
1192568 41 YES 7864320000 14-NOV-13 01.42.24.000000000 PM YES V_2_5_68_DONE
1202400 58 YES 1.1429E+10 19-NOV-13 10.53.09.000000000 AM YES V_2_5_70_DONE
1214152 101 YES 3250585600 28-NOV-13 10.27.45.000000000 AM YES V_2_6_4_DONE
1226880 125 YES 943718400 29-NOV-13 12.57.29.000000000 PM YES V_2_6_5_DONE
1237237 129 YES 2411724800 29-NOV-13 01.47.54.000000000 PM YES V_2_6_6_DONE
1247575 134 YES 2831155200 04-DEC-13 02.40.40.000000000 PM YES V_2_6_7_DONE
1258352 147 YES 2.4432E+10 09-DEC-13 02.48.46.000000000 PM YES V_2_6_8_DONE
5415174 162 YES 629145600 22-JAN-14 10.20.17.000000000 PM YES V_2_6_9_DONE
14 rows selected.
So all the rows seem to be from very low incarnation numbers - this is very odd - none of them correspond to anything shown in the list incarnation output from rman.
Hmm what now - lets see what rman thinks about restore points(discovered this by accident when trying to fix this issue)
RMAN> list restore point all;
SCN RSP Time Type Time Name
---------------- --------- ---------- --------- ----
920045 GUARANTEED 08-NOV-13 V_2_6_1_DONE
936699 GUARANTEED 08-NOV-13 V_2_5_65_DONE
944383 GUARANTEED 11-NOV-13 V_2_5_65_EBLD_DONE
944814 GUARANTEED 11-NOV-13 V_2_5_66_DONE
953980 GUARANTEED 11-NOV-13 V_2_5_67_DONE
1182226 GUARANTEED 12-NOV-13 V_2_5_67_JVM_EBLD_DONE
1192568 GUARANTEED 13-NOV-13 V_2_5_68_DONE
1202400 GUARANTEED 15-NOV-13 V_2_5_70_DONE
1214152 GUARANTEED 27-NOV-13 V_2_6_4_DONE
1226880 GUARANTEED 29-NOV-13 V_2_6_5_DONE
1237237 GUARANTEED 29-NOV-13 V_2_6_6_DONE
1247575 GUARANTEED 04-DEC-13 V_2_6_7_DONE
1258352 GUARANTEED 09-DEC-13 V_2_6_8_DONE
5415174 GUARANTEED 22-JAN-14 V_2_6_9_DONE
what the hell lets try again
RMAN> flashback database to restore point V_2_6_8_DONE;
Starting flashback at 24-JAN-14
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of flashback command at 01/24/2014 16:01:58
RMAN-00600: internal error, arguments [8714] [] [] [] []
RMAN> exit
new error!
lets try in sqlplus
SQL> flashback database to restore point V_2_6_8_DONE;
flashback database to restore point V_2_6_8_DONE
*
ERROR at line 1:
ORA-38754: FLASHBACK DATABASE not started; required redo log is not available
ORA-38762: redo logs needed for SCN 1247576 to SCN 1258352
ORA-38761: redo log sequence 1 in thread 1, incarnation 147 could not be
accessed
Thats still the same
lets try rman but allocate a tape channel to try and get the archivelogs back it wants
RMAN> run
2> {allocate channel t1 type sbt_tape;
3> restore archivelog from scn 1247576 until scn 1258352;
4> }
released channel: ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of allocate command on t1 channel at 01/24/2014 16:21:19
ORA-19554: error allocating device, device type: SBT_TAPE, device name:
ORA-27000: skgfqsbi: failed to initialize storage subsystem (SBT) layer
Linux-x86_64 Error: 2534: Unknown system error
Additional information: 7011
ORA-19511: Error received from media manager layer, error text:
SBT error = 7011, errno = 2534, sbtopen: system error
RMAN> exit
ah - forgot tdp config file
RMAN> run {
2> allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3> restore archivelog from scn 1247576 until scn 1258352;
4> }
released channel: ORA_DISK_1
allocated channel: t1
channel t1: SID=573 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0
Starting restore at 24-JAN-14
released channel: t1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 01/24/2014 16:23:07
RMAN-06026: some targets not found - aborting restore
nope thats going nowhere fast
maybe due to all the messing around they did some arch logs are only on disk and not 'recorded' in the recovery area
RMAN> catalog recovery area;
this finds some stuff but nothing of any use
hmm - lets list incarnation again
RMAN> list incarnation;
using target database control file instead of recovery catalog
List of Database Incarnations
and 147 is there
hold on - what just happened there?
Seems the incarnation information in the controlfile is different to the rman catalog - i still dont understand how this happened - it must be something to do with the initial flashback command i think that somehow did something to the controlfile - anyway this is sort of progress.
Lets try a restore without a catalog connection
RMAN> run {
2> allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3> flashback database to restore point V_2_6_8_DONE;
4> }
released channel: ORA_DISK_1
allocated channel: t1
channel t1: SID=1 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0
Starting flashback at 24-JAN-14
starting media recovery
unable to find archived log
archived log thread=1 sequence=1
released channel: t1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of flashback command at 01/24/2014 16:47:15
RMAN-06054: media recovery requesting unknown archived log for thread 1 with sequence 1 and starting SCN of 1247577
different - but healthier than the last errors
lets try resetting to incarnation 147 and then running it
RMAN> reset database to incarnation 147;
database reset to incarnation 147
RMAN> run {
2> allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3> flashback database to restore point V_2_6_8_DONE;
4> }
allocated channel: t1
channel t1: SID=1 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0
Starting flashback at 24-JAN-14
starting media recovery
unable to find archived log
archived log thread=1 sequence=1
released channel: t1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of flashback command at 01/24/2014 16:48:58
RMAN-06054: media recovery requesting unknown archived log for thread 1 with sequence 1 and starting SCN of 1247577
RMAN> exit
nope - still same thing
So now I'm thinking the catalog seems to know about the archivelog backups but not the incarnations and the controlfile knows about the incarnations but not the archivelogs - so what to do....
I can see that it wants logseq 1 from one of the incarnations that exist around a time period - there are 5 of them.
Lets just restore logfile 1 from all of them and see what happens!
so i do this 5 times with different id's
RMAN> reset database to incarnation 919071350;
database reset to incarnation 919071350
RMAN> run{
2> allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3> restore archivelog from logseq 1 until logseq 1;
4> }
released channel: ORA_DISK_1
allocated channel: t1
channel t1: SID=1 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0
Starting restore at 24-JAN-14
channel t1: starting archived log restore to default destination
channel t1: restoring archived log
archived log thread=1 sequence=1
channel t1: reading from backup piece redolog_EDERBLD_833727641_20131209_658_1.rman
channel t1: piece handle=redolog_EDERBLD_833727641_20131209_658_1.rman tag=TAG20131209T150039
channel t1: restored backup piece 1
channel t1: restore complete, elapsed time: 00:01:45
Finished restore at 24-JAN-14
released channel: t1
Ok - thats all done - lets try again
RMAN> run{
2> allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3> flashback database to restore point V_2_6_8_DONE;}
using target database control file instead of recovery catalog
allocated channel: t1
channel t1: SID=573 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0
Starting flashback at 24-JAN-14
starting media recovery
media recovery failed
released channel: t1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of flashback command at 01/24/2014 17:05:19
ORA-00283: recovery session canceled due to errors
ORA-38770: FLASHBACK DATABASE failed during recovery.
ORA-19909: datafile 1 belongs to an orphan incarnation
ORA-01110: data file 1: '/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_system_97l4vng9_.dbf'
RMAN> exit
WHAT! - anyway it's different......
that was without catalog connection - lets try with
RMAN> run{
2> allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3> flashback database to restore point V_2_6_8_DONE;}
allocated channel: t1
channel t1: SID=573 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0
Starting flashback at 24-JAN-14
starting media recovery
media recovery complete, elapsed time: 00:00:01
channel t1: starting archived log restore to default destination
channel t1: restoring archived log
archived log thread=1 sequence=1
channel t1: reading from backup piece redolog_EDERBLD_833297422_20131204_367_1.rman
channel t1: ORA-19870: error while restoring backup piece redolog_EDERBLD_833297422_20131204_367_2.rman
ORA-19507: failed to retrieve sequential file, handle="redolog_EDERBLD_833297422_20131204_367_2.rman", parms=""
ORA-27029: skgfrtrv: sbtrestore returned error
ORA-19511: Error received from media manager layer, error text:
ANU2614E Invalid sequence of function calls to Data Protection for Oracle
Finished flashback at 24-JAN-14
released channel: t1
hmm - better
some tape issue - lets just try again
RMAN> run{
2> allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3> flashback database to restore point V_2_6_8_DONE;}
allocated channel: t1
channel t1: SID=573 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0
Starting flashback at 24-JAN-14
starting media recovery
media recovery complete, elapsed time: 00:00:07
Finished flashback at 24-JAN-14
released channel: t1
Hold on - no errors that seemed to have worked!
lets check checkpoint numbers on the datafiles
SQL> col xx format 99999999999999999999
SQL> select name,CHECKPOINT_CHANGE# xx from v$datafile
2 /
NAME
--------------------------------------------------------------------------------
XX
---------------------
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_system_97l4vng9_.dbf
1258353
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_sysaux_97l4vp6z_.dbf
1258353
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_sys_undo_97l4vq97_.dbf
1258353
NAME
--------------------------------------------------------------------------------
XX
---------------------
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_dwh_data_97l5rngl_.dbf
1258353
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_dwh_inte_97l5romo_.dbf
1258353
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_dwh_stag_97l5rpvw_.dbf
1258353
NAME
--------------------------------------------------------------------------------
XX
---------------------
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_sap_bods_97l5rqy4_.dbf
1258353
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_dwh_test_97l5rs21_.dbf
1258353
8 rows selected.
They all seem to be in sync
- lets try and open it then....
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01589: must use RESETLOGS or NORESETLOGS option for database open
Ok fair enough - resetlogs is needed after flashback
SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-39700: database must be opened with UPGRADE option
Process ID: 26548
Session ID: 495 Serial number: 123
Come on......
Now we are back to the version thing - whcih should be a simple fix we just revert to 11.2.0.3 and open it from there..
alter database open resetlogs
*
ERROR at line 1:
ORA-01139: RESETLOGS option only valid after an incomplete database recovery
*
ERROR at line 1:
ORA-01139: RESETLOGS option only valid after an incomplete database recovery
hmm - lets try without resetlogs
SYS@DB>alter database open;
Database altered.
SYS@DB>
SYS@DB>alter database open;
Database altered.
SYS@DB>
and it only goes and works!
This last bit is explainable - the resetlogs at 11.2.0.4 worked but it couldnt open the database, reverting to 11.2.0.3 and the resetlogs wasnt needed and the database just opened.
The earlier part is still a bit of a mystery and it was just guesswork really to try and restore a random set of archive logs. Somehow the controlfile and the recovery catalog had very different ideas of what incarnations were and what id's it was using for both.
However - I don't really feel that comfortable about giving this back - we are going to see what the developers want to do and maybe just restore from a standard rman backup to an appropriate time.
Maybe this helps someone with a similar bizarre scenario and it also shows the importance of sticking with a problem and trying things out when you have nothing to lose.
We've now backed up what we have and we'll wait for the developers to come back......
Comments
Post a Comment