When FRA maths goes wrong



Stop press - I'm now on twitter thanks to peer group pressure (Tim and Chris you know who you are) so I'll hopefully be tweeting all my blog entries from now on (as long as i remember to do it...)

https://twitter.com/dbaharrison

Anyway - back to the blog

It's been a while since my last post (just been too busy with work to write anything up) but the issue we've had over the past couple of days i think warrants making the extra effort.

The post concerns the FRA (fast/flash recovery area - depending on your preference) on one of our 12.1.0.2 databases - specifically the cloud control repository - though the application is really irrelevant in this case.

For many years now we've used the FRA to store all of our archivelogs (regardless of whether flashback is actually enabled or not) - it's just cleaner that way and less maintenance overhead. We had a few teething issues initially (mainly understanding exactly how the mechanics of it worked) but for a long time now we've had no issues - that was until yesterday.

Where we saw this

Errors in file /oracle/admin/DB/diag/rdbms/DB/DB/trace/DB_arc1_29062.trc:
ORA-19815: WARNING: db_recovery_file_dest_size of 19327352000 bytes is 46.97% used, and has 10248593920 remaining bytes available.
************************************************************************
You have following choices to free up space from recovery area:
1. Consider changing RMAN RETENTION POLICY. If you are using Data Guard,
   then consider changing RMAN ARCHIVELOG DELETION POLICY.
2. Back up files to tertiary device such as tape using RMAN
   BACKUP RECOVERY AREA command.
3. Add disk space and increase db_recovery_file_dest_size parameter to
   reflect the new space.
4. Delete unnecessary files using RMAN DELETE command. If an operating
   system command was used to delete files, then use RMAN CROSSCHECK and
   DELETE EXPIRED commands.
************************************************************************
Errors in file /oracle/admin/DB/diag/rdbms/DB/DB/trace/DB_arc1_29062.trc:
ORA-19809: limit exceeded for recovery files
ORA-19804: cannot reclaim 393922560 bytes disk space from 19327352000 limit

Initially we thought this was just some mismatch issue - the disk is clearly full as the system cannot create an archive log but the FRA is only 46.97% full.

This can generally happen in 2 ways:

1. The FRA is sized bigger than the actual physical disk - so oracle thinks more space is available than there actually is
2. Files that are non database files (or db files that the db does not know about) are taking up space in the FRA - so oracle thinks there is more space available than there actually is

In our case neither of these was true - in fact v$flash_recovery_area_usage clearly showed that it thought the area was indeed full. This did not match with the error in the alert log from the arc process that it thought that the area was only 47% full. As it was only 47% full it hadn't triggered the auto delete of old backed up files - and even the out of space condition also hadn't triggered it.

So what was going on?

Well.... after a lot of false leads - including removing the dataguard attached to this db and sizing the FRA up and down in a vain attempt to make oracle 'wake up' and realize what was going on we finally decided this looks like a bug...

So we crawled through MOS (and google) and while there is often reference to this error - all the causes we had were ruled out or didn't apply in 12.1.0.2.

So what next - well as i know that a lot of this information is in the controlfile and that is likely where arc is checking against for the internal check we decided to shutdown the db and recreate the controlfiles.

so we ran that:

Completed: CREATE CONTROLFILE REUSE DATABASE "DB" NORESETLOGS FORCE LOGGING ARCHIVELOG
    MAXLOGFILES 16
    MAXLOGMEMBERS 2
    MAXDATAFILES 30
    MAXINSTANCES 1
    MAXLOGHISTORY 1168
LOGFILE
  GROUP 1 (
    '/oracle/DB/oradata/DB/onlinelog/o1_mf_1_81q5cfpz_.log',
    '/oracle/DB/recovery_area/DB/onlinelog/o1_mf_1_81q5cltn_.log'
  ) SIZE 400M BLOCKSIZE 512,
  GROUP 2 (
    '/oracle/DB/oradata/DB/onlinelog/o1_mf_2_81q5bdfb_.log',
    '/oracle/DB/recovery_area/DB/onlinelog/o1_mf_2_81q5bkgr_.log'
  ) SIZE 400M BLOCKSIZE 512,
  GROUP 3 (
    '/oracle/DB/oradata/DB/onlinelog/o1_mf_3_81q59rnp_.log',
    '/oracle/DB/recovery_area/DB/onlinelog/o1_mf_3_81q59xwl_.log'
  ) SIZE 400M BLOCKSIZE 512
-- STANDBY LOGFILE
--   GROUP 4 '/oracle/DB/oradata/DB/onlinelog/standby_redo01.log'  SIZE 400M BLOCKSIZE 512,
--   GROUP 5 '/oracle/DB/oradata/DB/onlinelog/standby_redo02.log'  SIZE 400M BLOCKSIZE 512,
--   GROUP 6 '/oracle/DB/oradata/DB/onlinelog/standby_redo03.log'  SIZE 400M BLOCKSIZE 512,
--   GROUP 7 '/oracle/DB/oradata/DB/onlinelog/standby_redo04.log'  SIZE 400M BLOCKSIZE 512
DATAFILE
  '/oracle/DB/oradata/DB/datafile/o1_mf_system_76gbqhq1_.dbf',
  '/oracle/DB/oradata/DB/datafile/o1_mf_sysaux_76gbqkxk_.dbf',
  '/oracle/DB/oradata/DB/datafile/o1_mf_sys_undo_76gbqnxl_.dbf',
  '/oracle/DB/oradata/DB/datafile/mgmt_ecm_depot1.dbf',
  '/oracle/DB/oradata/DB/datafile/mgmt.dbf',
  '/oracle/DB/oradata/DB/datafile/mgmt_ad4j.dbf',
  '/oracle/DB/oradata/DB/datafile/o1_mf_eetpptdb_76gv4vnn_.dbf',
  '/oracle/DB/oradata/DB/datafile/o1_mf_xdb_76xz4r6w_.dbf',
  '/oracle/DB/oradata/DB/datafile/o1_mf_apex_76y18r5m_.dbf',
  '/oracle/DB/oradat
2015-07-16 10:46:59.202000 +01:00
ALTER DATABASE RECOVER  DATABASE
Media Recovery Start
 Started logmerger process
Parallel Media Recovery started with 8 slaves
Recovery of Online Redo Log: Thread 1 Group 1 Seq 30198 Reading mem 0
  Mem# 0: /oracle/DB/oradata/DB/onlinelog/o1_mf_1_81q5cfpz_.log
  Mem# 1: /oracle/DB/recovery_area/DB/onlinelog/o1_mf_1_81q5cltn_.log
2015-07-16 10:47:06.158000 +01:00
Media Recovery Complete (DB)
Completed: ALTER DATABASE RECOVER  DATABASE

*** Note here - need to see what is going on with redo logs here - there seems to be the syntax "BLOCKSIZE 512" added - is that a new 12c thing? ****

So we brought everything back up - now the FRA views show no archive logs (as the area has been wiped essentially)

so we run

catalog recovery area;

This rediscovers whats in the FRA and updates the db/controlfile.

So those figures look correct (but they always did) - but what about the background process that's deleting things?

We shrink the FRA down to trigger the process (i think this threshold is 85% full but we shrunk it to a size that would make it 90% full)

ALTER SYSTEM SET db_recovery_file_dest_size=16G SCOPE=BOTH;

and.....

it works! it realizes it's 90% full and triggers the deletion of old backed up logs

case solved!

No idea what caused this weird corruption - but anyway  we have a fix now - hopefully its just a one off but if you do get it at least you have a fix now.....






2 comments:

  1. *** Note here - need to see what is going on with redo logs here - there seems to be the syntax "BLOCKSIZE 512" added - is that a new 12c thing? ****

    Nope it is not at least 11.2.0.4 has this already but I remember seeing this since quite a while.

    ReplyDelete
  2. Hi Balazs,
    Indeed you are correct - from this doc

    http://docs.oracle.com/cd/E18283_01/server.112/e17120/onlineredo002.htm

    Beginning with Oracle Database 11g Release 2, you can specify the block size of online redo log files with the BLOCKSIZE keyword in the CREATE DATABASE, ALTER DATABASE, and CREATE CONTROLFILE statements. The permissible block sizes are 512, 1024, and 4096.

    Has anyone actually done this though....?

    Cheers,
    Rich

    ReplyDelete