When flashback causes block corruption……

 

boom-159780_640

 

We had some ‘fun’ yesterday with the fallout from doing a flashback on out test datawarehouse, flashback is used to death on these systems and has never really caused an issue before until what happened yesterday came to light.

Basically what happened was after the flashback the application was getting errors (the dreaded ORA-01578) that data blocks had been loaded with the nologging option, initially you think what? Then you realise that when oracle does a flashback it doesn’t just use the flashback logs, it uses them to do some of the work but it then has to use information in the archive logs to complete the flashback – you can essentially think of it as the flashback to a restore point goes to some point before the actual restore point and then rolls forward from there – anything that has not been logged for that small media recovery window will result in the logical block corruption above.

In our case the nologging had been cause by the developers using the +APPEND hint (the one that everyone forgets causes it to be nologging).

To identify how much this had affected we initially ran dbv – which ran surprising quickly for a very large (2TB) database, this flagged up a lot of blocks but converting these into actual objects would prove painful. DBV does not update v$database_block_corruption so we couldn’t easily just run a query to find the problem. However the dbv had shown the corruption was only in one tablespace so we could run a quick rman validate to check this:

[oracle@server]:EIER:/tsm/EIER/log# rman target=/

Recovery Manager: Release 11.2.0.3.0 - Production on Wed Nov 6 19:11:42 2013

Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.

connected to target database: EIER (DBID=1833381966)

RMAN> validate check logical tablespace DWH_STAGING;

Starting validate at 06-NOV-13

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=1219 device type=DISK

channel ORA_DISK_1: starting validation of datafile

channel ORA_DISK_1: specifying datafile(s) for validation

input datafile file number=00005 name=/oracle/EIER/oradata/EIER/datafile/o1_mf_dwh_stag_83w4tdy4_.dbf

input datafile file number=00008 name=/oracle/EIER/oradata/EIER/datafile/o1_mf_dwh_stag_856br43g_.dbf

input datafile file number=00042 name=/oracle/EIER/oradata/EIER/datafile/o1_mf_dwh_stag_92twl0d0_.dbf

input datafile file number=00043 name=/oracle/EIER/oradata/EIER/datafile/o1_mf_dwh_stag_92twswb9_.dbf

channel ORA_DISK_1: validation complete, elapsed time: 00:06:05

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

5 OK 90686 2061 4194302 8933769571115

File Name: /oracle/EIER/oradata/EIER/datafile/o1_mf_dwh_stag_83w4tdy4_.dbf

Block Type Blocks Failing Blocks Processed

---------- -------------- ----------------

Data 0 4080545

Index 0 344

Other 0 111352

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

8 OK 94972 13426 4194302 8933767704590

File Name: /oracle/EIER/oradata/EIER/datafile/o1_mf_dwh_stag_856br43g_.dbf

Block Type Blocks Failing Blocks Processed

---------- -------------- ----------------

Data 0 4062075

Index 0 185

Other 0 118616

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

42 OK 90372 1 217600 8933767704253

File Name: /oracle/EIER/oradata/EIER/datafile/o1_mf_dwh_stag_92twl0d0_.dbf

Block Type Blocks Failing Blocks Processed

---------- -------------- ----------------

Data 0 124896

Index 0 0

Other 0 92703

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

43 OK 90384 1937 204800 8933767704232

File Name: /oracle/EIER/oradata/EIER/datafile/o1_mf_dwh_stag_92twswb9_.dbf

Block Type Blocks Failing Blocks Processed

---------- -------------- ----------------

Data 0 110416

Index 0 0

Other 0 92447

Finished validate at 06-NOV-13

All useful info – but the main thing is the v$database_block_corruption view is now populated and we can join it to find what the problem objects are:

SELECT distinct owner, segment_name, segment_type

FROM dba_extents e,v$database_block_corruption x

WHERE e.file_id = x.file#

AND x.block# BETWEEN e.block_id AND e.block_id + e.blocks – 1

This returns the ‘bad’ objects. (it does take quite a while – there must be a more efficient way of identifying the objects). This returned 2 tables (well a partition and a subpartition of 2 different tables – but essentially 2 tables)

The developers told us that to recreate the data was possible but they needed to use the same id’s as previously (the data all had the same identifiers for each row for this id), all we needed was one id.

However as soon as you query the table you get the block corruption error, there were so many bad blocks trying to flag them all to be ignored would have taken ages but we found that by setting a session event you could skip the corrupt blocks (I’m sure there is a hint for this too but I couldn’t find it):

alter session set events='10231 trace name context forever';

a select * from the table then worked ok – skipping all the bad blocks but still returned 0 rows – as we perhaps might expect as the whole table was loaded in one go with the APPEND hint…….

The developers then decided they could recreate the values from information logged in the application somewhere – so It looks like all will be OK.

Anyway – this was a useful learning exercise and a reminder to be careful with nologging stuff, I’ve heard it said before that some DBA’s just set all databases to FORCE LOGGING and I can see this seems to be sound advice – how much benefit do we get from nologging for it to outweigh some of the risks/problems it can create…

-- Addendum--

Another useful thing that came out when we discussed this with another internal team as they swore that they had tested deleting archivelogs before doing a flashback and that it worked fine. Initially i didn’t believe them but after a bit if digging it seems that what had happened was this:

1) switch flashback database off (yes off)

2) create guaranteed restore point (yes this does work unbelievably even when flashback is off)

3) do some work, delete archivelogs and then flashback

And amazingly this does work – it seems when flashback db is switched off GRP’s can still be created and flashed back to. The blocks are logged differently (hence no need for arch logs) but you can only go back to the GRP and no other point.

You learn something every day……


2 comments:

  1. Hi Rich,

    careful with this sentence:
    "In our case the nologging had been cause by the developers using the +APPEND hint (the one that everyone forgets causes it to be nologging)."
    As I'm extensively testing our DWH's Nologging capabilities I think it's useful to follow this chart if you ever get confused:

    https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:5280714813869

    and we said...

    It is even more deep then that. For example:

    Table Mode Insert Mode ArchiveLog mode result
    ----------- ------------- ----------------- ----------
    LOGGING APPEND ARCHIVE LOG redo generated
    NOLOGGING APPEND ARCHIVE LOG no redo
    LOGGING no append "" redo generated
    NOLOGGING no append "" redo generated
    LOGGING APPEND noarchive log mode no redo
    NOLOGGING APPEND noarchive log mode no redo
    LOGGING no append noarchive log mode redo generated
    NOLOGGING no append noarchive log mode redo generated

    Cheers,
    Balazs

    ReplyDelete
  2. Hi Balazs,
    Thanks for that - poor choice of wording by me. I should have said the 'lack of complete redo' was caused by the developer using APPEND rather than the 'nologging' was caused by the APPEND.

    Cheers,
    Rich

    ReplyDelete