A few weeks ago i was contacted by the ASM product manager in a follow up to my post on the corruption issues i'd had with ASM in Azure. We had a conversation about the issues that I'd had and he told me that Oracle were going to be doing some more investigation into the issue with Microsoft as it was believed that the issue really shouldn't have happened in the first place (even ignoring my workaround). A couple of weeks ago he gave me an update and I'm sharing that email below (Jim - i think it sounds better to include most of the email as it has some extra explanation info that people may fine interesting - i took out your email address to avoid you getting spammed by everyone :-))
Richard,
The last time we spoke, I indicated that Oracle would be
doing independent testing of the corruption issue you first reported in your
blog. Today, I would like to bring you up to date about what we found and also
request an update to your blog to reflect our understanding of the problem.
The testing was a joint effort between Microsoft and Oracle.
Microsoft's engineering team provided the identification of the critical
bug. The testing began by reproducing the corruption with much of the same
configuration reported by you. We were readily able to reproduce the exact same
corrupted data pattern. Our testing also revealed that certain Linux kernels
experienced this problem while others did not have a problem. Microsoft seemed
to have a pretty good idea of what the problem was and provided Oracle a patch.
With the Microsoft supplied Linux kernel patch we could not
recreate the corruption regardless of load placed on the database. The patch
modifies memory block management handling in the kernel running in the VM
associated with doing IO. The problem scenario is that when an Oracle database
utilizing ASM runs in an Azure VM, Microsoft's paravirtualization driver
(storvsc), running in the VM, interacts with the kernel IO buffer logic to
cause wrong data to be written by the database log-writer. The nature of the
corruption is unpredictable and happens infrequently, however under heavy load,
without the patch the corruption is easily reproduced.
The nature of the bug is that errors are not reported at the
moment of the corruption. It is only later when database's archive process
reads the redo logs that the corruption is detected and reported. It is not
known if there are other silent data corruptions occurring to other files and
are simply not reported.
There are a number of circumventions reported by you and
others, including not using ASM for database storage, treating the ASM disks as
4K sector devices, avoiding the 3.10.0-514 Redhat kernel, and using ASMLIB
for device management. At this point, if at all possible, Oracle recommends customers
simply avoid this particular kernel in an Azure environment. They could use
Oracle's UEK kernel (Oracle Linux Azure VM) or an older Redhat kernel. The
other workarounds of treating ASM disks as 4K sector disks and/or using ASMLIB,
while likely effective, do involve additional management efforts, and we're not
completely certain that the issue is entirely avoided.
With respect to your blog, I request that it be updated to
include the following points:
·
LUNs in an Azure are presented as
512e devices. That means that internally they are structured as 4K "sector"
disks (physical), but emulate 512-byte sector disks (logical) from an
application perspective. I put "sector" in quotes because this is not
really true of SSD disks, but the Advanced Format Disk (512e) specification
was written with conventional rotating disks in mind. ASM and the Oracle
database work correctly with Advanced Format disks in 512-byte emulation mode,
and it is not necessary to create 4K disk groups for correct operation. Some
flash storage vendors recommend doing so, but strictly for performance reasons
associated with their particular products. There is no need for ASM to
detect sector size in this respect.
·
The bug discussed here is with
particular Redhat kernels. This bug is exposed in an Azure virtualization
environment with ASM. The kernel we know to be problematic is 3.10.0-514. There
may be other kernels having the problem, but we could not reproduce the issue
with Redhat kernel 3.10.0-327 or Oracle’s current UEK kernel.
·
We do not know if creating 4K sector
disk groups is a complete fix. At best it circumvents the bug. Our testing
seems to verify it as a reliable workaround, but there may be situations where
data is still silently corrupted.
Thank you for reporting this issue. Oracle takes the issue
of data corruption as one of our most important concerns. If there is other
information I can provide or if you would like to discuss this by phone, please
let me know.
https://access.redhat.com/solutions/3114361
ReplyDeleteThanks for link up to the redhat note on this
ReplyDeleteWorthful Azure tutorial. Appreciate a lot for taking up the pain to write such a quality content on Azure course. Just now I watched this similar -Azure tutorial and I think this will enhance the knowledge of other visitors for sure. Thanks anyway.https://www.youtube.com/watch?v=F2mZxlaEOtI
ReplyDeletelianidis-tsu Angela Freedom https://wakelet.com/wake/VdNz-RdbDZMm4XHcADFJy
ReplyDeletethetholooki
flagleplerto Kathy Smith click
ReplyDeleteclick
click
click here
midpugesttur
tralhartrif_ke Danielle Ocasio MorphVOX Pro
ReplyDeleteFastStone Capture
ESET NOD32 Internet Security
kingwhartimel