The tale of the database that came back from the dead......



Now before I tell this story i want to make clear that i still dont 100% understand how this thing got fixed..... There are a number of bizarre errors along the way when normally we'd of just switched off the life support machine but we persevered with it anyway and somehow it came back....

See if you can work out how it actually managed to work as I'm still a little confused.

Ok so here goes - what we wanted to do was flashback a database to a previous guaranteed restore point. Ths database however is greatly abused by the development team who flashback and forth seemingly for fun - there are almost a thousand incarnations of it recorded in the rman catalog.

Anyway thats a litle background - the db was 11.2.0.3 but was patched to 11.2.0.4 in the past couple of weeks.

The restore point was before the patching was done - but that doesnt really seem part of the issue (apart from a minot thing right at the end).

So first off the developers couldnt flash back (whcih has happened before), but previously running the command from rman rather than sqlplus seems to have always resolved the issue - however this time it did not

They were getting

[oracle@server]:DB:/oracle/home/oracle# oerr ora 38754
38754, 00000, "FLASHBACK DATABASE not started; required redo log is not available"
// *Cause: A FLASHBACK DATABASE command did not start.  A redo log needed for
//         the recovery part of FLASHBACK DATABASE could not be found or
//         accessed.
// *Action: See trace file


Lets try it for ourselves

RMAN> flashback database to restore point V_2_6_8_DONE
2> ;

Starting flashback at 24-JAN-14
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=573 device type=DISK

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of flashback command at 01/24/2014 15:18:43
ORA-00600: internal error code, arguments: [krbrfbaf_inv_fno], [9], [], [], [], [], [], [], [], [], [], []

RMAN> exit

Nice error - not seem that one before

Lets try in sqlplus

SQL>
flashback database to restore point V_2_6_8_DONE
*
ERROR at line 1:
ORA-38754: FLASHBACK DATABASE not started; required redo log is not available
ORA-38762: redo logs needed for SCN 1247576 to SCN 1258352
ORA-38761: redo log sequence 1 in thread 1, incarnation 147 could not be
accessed


OK - a more meaningful error - lets see what incarnation 147 is

RMAN> list incarnation of database;

we get loads of output (the thousand i was talking about) and there is no 147 - what? that makes no sense

does it mean the 147th incarantion rather than incarnation number 147 (surely not) - but lets check anyway. Turns out thats over 9months old so it cant be that.

Lets just try a restore command of the archivelogs for those scn's

RMAN> restore archivelog from scn 1247576 until scn 1258352;

That spews out a load of errors, 50% can;t be found in the catalog and 50% it complains tape channel isnt allocated. So even if we allocate tape thats not going to help.

Lets have a look at v$restore_point as I'm still not happy about this incarantion 147 thing

       SCN DATABASE_INCARNATION# GUARANTEE STORAGE_SIZE TIME                                                                        RESTORE_POINT_TIME      PRESERVED NAME
---------- --------------------- --------- ------------ --------------------------------------------------------------------------- --------------------------------------------------------------------------- --------- --------------------------------------------------------------------------------------------------------------------------------
    920045                     1 YES          104857600 08-NOV-13 10.31.48.000000000 AM                                                                     YES        V_2_6_1_DONE
    936699                     4 YES         2516582400 08-NOV-13 05.18.29.000000000 PM                                                                     YES        V_2_5_65_DONE
    944383                     7 YES                  0 11-NOV-13 12.45.27.000000000 PM                                                                     YES        V_2_5_65_EBLD_DONE
    944814                     8 YES          943718400 11-NOV-13 01.09.11.000000000 PM                                                                     YES        V_2_5_66_DONE
    953980                    21 YES          419430400 12-NOV-13 10.12.03.000000000 AM                                                                     YES        V_2_5_67_DONE
   1182226                    22 YES         2306867200 12-NOV-13 03.08.06.000000000 PM                                                                     YES        V_2_5_67_JVM_EBLD_DONE
   1192568                    41 YES         7864320000 14-NOV-13 01.42.24.000000000 PM                                                                     YES        V_2_5_68_DONE
   1202400                    58 YES         1.1429E+10 19-NOV-13 10.53.09.000000000 AM                                                                     YES        V_2_5_70_DONE
   1214152                   101 YES         3250585600 28-NOV-13 10.27.45.000000000 AM                                                                     YES        V_2_6_4_DONE
   1226880                   125 YES          943718400 29-NOV-13 12.57.29.000000000 PM                                                                     YES        V_2_6_5_DONE
   1237237                   129 YES         2411724800 29-NOV-13 01.47.54.000000000 PM                                                                     YES        V_2_6_6_DONE
   1247575                   134 YES         2831155200 04-DEC-13 02.40.40.000000000 PM                                                                     YES        V_2_6_7_DONE
   1258352                   147 YES         2.4432E+10 09-DEC-13 02.48.46.000000000 PM                                                                     YES        V_2_6_8_DONE
   5415174                   162 YES          629145600 22-JAN-14 10.20.17.000000000 PM                                                                     YES        V_2_6_9_DONE

14 rows selected.

So all the rows seem to be from very low incarnation numbers - this is very odd - none of them correspond to anything shown in the list incarnation output from rman.

Hmm what now - lets see what rman thinks about restore points(discovered this by accident when trying to fix this issue)


RMAN> list restore point all;

SCN              RSP Time  Type       Time      Name
---------------- --------- ---------- --------- ----
920045                     GUARANTEED 08-NOV-13 V_2_6_1_DONE
936699                     GUARANTEED 08-NOV-13 V_2_5_65_DONE
944383                     GUARANTEED 11-NOV-13 V_2_5_65_EBLD_DONE
944814                     GUARANTEED 11-NOV-13 V_2_5_66_DONE
953980                     GUARANTEED 11-NOV-13 V_2_5_67_DONE
1182226                    GUARANTEED 12-NOV-13 V_2_5_67_JVM_EBLD_DONE
1192568                    GUARANTEED 13-NOV-13 V_2_5_68_DONE
1202400                    GUARANTEED 15-NOV-13 V_2_5_70_DONE
1214152                    GUARANTEED 27-NOV-13 V_2_6_4_DONE
1226880                    GUARANTEED 29-NOV-13 V_2_6_5_DONE
1237237                    GUARANTEED 29-NOV-13 V_2_6_6_DONE
1247575                    GUARANTEED 04-DEC-13 V_2_6_7_DONE
1258352                    GUARANTEED 09-DEC-13 V_2_6_8_DONE
5415174                    GUARANTEED 22-JAN-14 V_2_6_9_DONE

what the hell lets try again

RMAN> flashback database to restore point V_2_6_8_DONE;

Starting flashback at 24-JAN-14
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of flashback command at 01/24/2014 16:01:58
RMAN-00600: internal error, arguments [8714] [] [] [] []

RMAN> exit

new error!

lets try in sqlplus

SQL> flashback database to restore point V_2_6_8_DONE;
flashback database to restore point V_2_6_8_DONE
*
ERROR at line 1:
ORA-38754: FLASHBACK DATABASE not started; required redo log is not available
ORA-38762: redo logs needed for SCN 1247576 to SCN 1258352
ORA-38761: redo log sequence 1 in thread 1, incarnation 147 could not be
accessed

Thats still the same

lets try rman but allocate a tape channel to try and get the archivelogs back it wants

RMAN> run
2> {allocate channel t1 type sbt_tape;
3>  restore archivelog from scn 1247576 until scn 1258352;
4> }

released channel: ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of allocate command on t1 channel at 01/24/2014 16:21:19
ORA-19554: error allocating device, device type: SBT_TAPE, device name:
ORA-27000: skgfqsbi: failed to initialize storage subsystem (SBT) layer
Linux-x86_64 Error: 2534: Unknown system error
Additional information: 7011
ORA-19511: Error received from media manager layer, error text:
   SBT error = 7011, errno = 2534, sbtopen: system error

RMAN> exit

ah - forgot tdp config file


RMAN> run {
2>  allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3> restore archivelog from scn 1247576 until scn 1258352;
4> }

released channel: ORA_DISK_1
allocated channel: t1
channel t1: SID=573 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0

Starting restore at 24-JAN-14

released channel: t1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 01/24/2014 16:23:07
RMAN-06026: some targets not found - aborting restore

nope thats going nowhere fast

maybe due to all the messing around they did some arch logs are only on disk and not 'recorded' in the recovery area

RMAN> catalog recovery area;

this finds some stuff but nothing of any use

hmm - lets list incarnation again

RMAN> list incarnation;

using target database control file instead of recovery catalog

List of Database Incarnations

and 147 is there

hold on - what just happened there?

Seems the incarnation information in the controlfile is different to the rman catalog - i still dont understand how this happened - it must be something to do with the initial flashback command i think that somehow did something to the controlfile - anyway this is sort of progress.

Lets try a restore without a catalog connection

RMAN> run {
2>  allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3>  flashback database to restore point V_2_6_8_DONE;
4> }

released channel: ORA_DISK_1
allocated channel: t1
channel t1: SID=1 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0

Starting flashback at 24-JAN-14


starting media recovery

unable to find archived log
archived log thread=1 sequence=1
released channel: t1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of flashback command at 01/24/2014 16:47:15
RMAN-06054: media recovery requesting unknown archived log for thread 1 with sequence 1 and starting SCN of 1247577

different - but healthier than the last errors

lets try resetting to incarnation 147 and then running it

RMAN> reset database to incarnation 147;

database reset to incarnation 147

RMAN> run {
2>  allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3>  flashback database to restore point V_2_6_8_DONE;
4> }

allocated channel: t1
channel t1: SID=1 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0

Starting flashback at 24-JAN-14


starting media recovery

unable to find archived log
archived log thread=1 sequence=1
released channel: t1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of flashback command at 01/24/2014 16:48:58
RMAN-06054: media recovery requesting unknown archived log for thread 1 with sequence 1 and starting SCN of 1247577

RMAN> exit

nope - still same thing

So now I'm thinking the catalog seems to know about the archivelog backups but not the incarnations and the controlfile knows about the incarnations but not the archivelogs - so what to do....

I can see that it wants logseq 1 from one of the incarnations that exist around a time period - there are 5 of them.

Lets just restore logfile 1 from all of them and see what happens!

so i do this 5 times with different id's


RMAN>  reset database to incarnation 919071350;

database reset to incarnation 919071350



RMAN> run{
2>  allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3> restore archivelog from logseq 1 until logseq 1;
4> }

released channel: ORA_DISK_1
allocated channel: t1
channel t1: SID=1 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0

Starting restore at 24-JAN-14

channel t1: starting archived log restore to default destination
channel t1: restoring archived log
archived log thread=1 sequence=1
channel t1: reading from backup piece redolog_EDERBLD_833727641_20131209_658_1.rman
channel t1: piece handle=redolog_EDERBLD_833727641_20131209_658_1.rman tag=TAG20131209T150039
channel t1: restored backup piece 1
channel t1: restore complete, elapsed time: 00:01:45
Finished restore at 24-JAN-14
released channel: t1

Ok - thats all done - lets try again


RMAN> run{
2>  allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3>  flashback database to restore point V_2_6_8_DONE;}

using target database control file instead of recovery catalog
allocated channel: t1
channel t1: SID=573 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0

Starting flashback at 24-JAN-14


starting media recovery
media recovery failed
released channel: t1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of flashback command at 01/24/2014 17:05:19
ORA-00283: recovery session canceled due to errors
ORA-38770: FLASHBACK DATABASE failed during recovery.
ORA-19909: datafile 1 belongs to an orphan incarnation
ORA-01110: data file 1: '/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_system_97l4vng9_.dbf'

RMAN> exit

WHAT! - anyway it's different......

that was without catalog connection - lets try with


RMAN> run{
2>   allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3>  flashback database to restore point V_2_6_8_DONE;}

allocated channel: t1
channel t1: SID=573 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0

Starting flashback at 24-JAN-14


starting media recovery

media recovery complete, elapsed time: 00:00:01
channel t1: starting archived log restore to default destination
channel t1: restoring archived log
archived log thread=1 sequence=1
channel t1: reading from backup piece redolog_EDERBLD_833297422_20131204_367_1.rman
channel t1: ORA-19870: error while restoring backup piece redolog_EDERBLD_833297422_20131204_367_2.rman
ORA-19507: failed to retrieve sequential file, handle="redolog_EDERBLD_833297422_20131204_367_2.rman", parms=""
ORA-27029: skgfrtrv: sbtrestore returned error
ORA-19511: Error received from media manager layer, error text:
   ANU2614E Invalid sequence of function calls to Data Protection for Oracle

Finished flashback at 24-JAN-14
released channel: t1

hmm - better

some tape issue - lets just try again


RMAN>  run{
2>   allocate channel t1 type 'sbt_tape' parms 'ENV=(TDPO_OPTFILE=/tsm/EDERBLD/conf/tdpo.opt)';
3>  flashback database to restore point V_2_6_8_DONE;}

allocated channel: t1
channel t1: SID=573 device type=SBT_TAPE
channel t1: Data Protection for Oracle: version 5.5.2.0

Starting flashback at 24-JAN-14


starting media recovery
media recovery complete, elapsed time: 00:00:07

Finished flashback at 24-JAN-14
released channel: t1


Hold on - no errors that seemed to have worked!

lets check checkpoint numbers on the datafiles

SQL> col xx format 99999999999999999999
SQL> select name,CHECKPOINT_CHANGE# xx from v$datafile
  2  /

NAME
--------------------------------------------------------------------------------
                   XX
---------------------
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_system_97l4vng9_.dbf
              1258353

/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_sysaux_97l4vp6z_.dbf
              1258353

/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_sys_undo_97l4vq97_.dbf
              1258353


NAME
--------------------------------------------------------------------------------
                   XX
---------------------
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_dwh_data_97l5rngl_.dbf
              1258353

/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_dwh_inte_97l5romo_.dbf
              1258353

/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_dwh_stag_97l5rpvw_.dbf
              1258353


NAME
--------------------------------------------------------------------------------
                   XX
---------------------
/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_sap_bods_97l5rqy4_.dbf
              1258353

/oracle/EDER/oradata/EDERBLD/datafile/o1_mf_dwh_test_97l5rs21_.dbf
              1258353


8 rows selected.
They all seem to be in sync

- lets try and open it then....

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01589: must use RESETLOGS or NORESETLOGS option for database open

Ok fair enough - resetlogs is needed after flashback

SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-39700: database must be opened with UPGRADE option
Process ID: 26548
Session ID: 495 Serial number: 123


Come on......

Now we are back to the version thing - whcih should be a simple fix we just revert to 11.2.0.3 and open it from there..

alter database open resetlogs
*
ERROR at line 1:
ORA-01139: RESETLOGS option only valid after an incomplete database recovery

hmm - lets try without resetlogs

SYS@DB>alter database open;

Database altered.

SYS@DB>

 
and it only goes and works!

This last bit is explainable - the resetlogs at 11.2.0.4 worked but it couldnt open the database, reverting to 11.2.0.3 and the resetlogs wasnt needed and the database just opened.

The earlier part is still a bit of a mystery and it was just guesswork really to try and restore a random set of archive logs. Somehow the controlfile and the recovery catalog had very different ideas of what incarnations were and what id's it was using for both.

However - I don't really feel that comfortable about giving this back - we are going to see what the developers want to do and maybe just restore from a standard rman backup to an appropriate time.

Maybe this helps someone with a similar bizarre scenario and it also shows the importance of sticking with a problem and trying things out when you have nothing to lose.

We've now backed up what we have and we'll wait for the developers to come back......




0 comments:

Post a Comment