The triangle of doom

Occasionally when you look at the performance screen in cloud control you get a surprise - but i wonder how many of you have seen a screen like this.....

Pretty daunting...

Clicking on the concurrency link to the right shows this

Which is only slightly less alarming :-)

The strangest thing when i first looked at this was the fact that all of the usernames were showing as SYS and the connection was clearly from our application server from the program name and we definitely are not using SYS in the configuration - very strange.

SO i logged on to sqlplus to look at the basic info - which showed me this from v$session (and lots of rows all with the same info)

USERNAME PROGRAM SQL_ID
------------------------------ ------------------------------------------------ -------------
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne
com.mchange.v2.async.ThreadPoolAsynchronousRunne

So a null username ( which i guess causes the SYS display in cloud control) and no sql_id running.

Very odd

Lets have a look at the listener log

And we see lots of what look like successful messages - at least the listener thinks its done its job

18-JUN-2015 14:54:16 * (CONNECT_DATA=(SID=DB)(SERVER=DEDICATED)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=APPSERVER$))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56362)) * establish * DB * 0
18-JUN-2015 14:54:16 * (CONNECT_DATA=(SID=DB)(SERVER=DEDICATED)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=APPSERVER$))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56363)) * establish * DB * 0
18-JUN-2015 14:54:16 * (CONNECT_DATA=(SID=DB)(SERVER=DEDICATED)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=APPSERVER$))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56365)) * establish * DB * 0
18-JUN-2015 14:54:16 * (CONNECT_DATA=(SID=DB)(SERVER=DEDICATED)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=APPSERVER$))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56364)) * establish * DB * 0
18-JUN-2015 14:54:16 * (CONNECT_DATA=(SID=DB)(SERVER=DEDICATED)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=APPSERVER$))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56366)) * establish * DB * 0
18-JUN-2015 14:54:16 * (CONNECT_DATA=(SID=DB)(SERVER=DEDICATED)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=APPSERVER$))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56367)) * establish * DB * 0
18-JUN-2015 14:54:16 * (CONNECT_DATA=(SID=DB)(SERVER=DEDICATED)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=APPSERVER$))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56368)) * establish * DB * 0
18-JUN-2015 14:54:16 * (CONNECT_DATA=(SID=DB)(SERVER=DEDICATED)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=APPSERVER$))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56369)) * establish * DB * 0
18-JUN-2015 14:54:16 * (CONNECT_DATA=(SID=DB)(SERVER=DEDICATED)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=APPSERVER$))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56370)) * establish * DB * 0
18-JUN-2015 14:54:16 * (CONNECT_DATA=(SID=DB)(SERVER=DEDICATED)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=APPSERVER$))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56371)) * establish * DB * 0

I also then checked the alert log and saw messages like this repeated many times

Fatal NI connect error 12170.

VERSION INFORMATION:
TNS for Linux: Version 12.1.0.1.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 12.1.0.1.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 12.1.0.1.0 - Production
Time: 18-JUN-2015 14:56:46
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=10.10.10.10)(PORT=56097))
WARNING: inbound connection timed out (ORA-3136)

So it looks like the client is not providing the connection info or time or there is just some network timeout or the process is just taking too long.

From the application server they report that the connection is not working so i logged on to that to see if there were any clues there

Firing up sqlplus i try this

I get a long pause and then the error you see above

However and most bizarrely of all if i choose any user apart from the application login it connects instantly!

stranger and stranger.....

So that rules out pretty much everything other than a problem with that account.....

Then it emerged that the password had been changed earlier that day

So i thought OK i'll try changing it again just to see what happens - and the command just hung - in the end i gave up and killed it.

OK - lets try a complete stop of the app server and a database restart.

So we do that and then sqlplus connections work fine and all appears OK - then we start the appserver and everything breaks again!

So now i'm thinking the appserver must have the wrong password and is in some kind of feedback loop and is swamping the database with so many requests it loses the plot - there is probably an appserver log somewhere full of errors but i don't know where to look for that.

Anyway i track down the password config screen and change the password (with the app completely stopped). Then i bring up the app again and......

It all works fine and the triangle of doom has vanished.

I'm not sure exactly why oracle behaves that way - i don't know if multiple requests have just overwhelmed it (although connections as other users worked fine) or if essentially the user$ row is locked out in some way by the volume of invalid connection requests.

Anyway an interesting one....

Comments

john19 June 2015 at 01:11
You could be hitting a bug
Bug 19867671 : LIBRARY CACHE LOCK CAUSED BY WRONG PASSWORD LOGIN
referenced in
High "Library Cache Locks" Wait Time Due to Invalid Login Attempts (Doc ID 1309738.1)
DBAHarrison19 June 2015 at 01:17
Hi John,
The symptoms are exactly the same but supposedly all this(and the related ones on the same metalink page) are all fixed in12c....

Maybe something is not quite right in 12.1.0.1 - anyway as long as we are careful to change the password properly then its a non issue.

Cheers,

Rich
DBAHarrison19 June 2015 at 01:41
No DB is 12.1.0.1 - but i'm missing something - can;t see any mention of 12.2 fix - there seems to be an enhancement request based on this bug (mentioned in the text of it) - but it doesnt mention a traget version

https://support.oracle.com/epmos/faces/BugDisplay?id=19867671

The other ones i could find all seem to have been supposedly fixed.

Maybe it's just my eyes though.... :-)

DBAHarrison19 June 2015 at 01:45
Ah! - it is my eyes :-)

just found it - somehow the top right of the display was being filtered out.......

Fixed in Product Version 12.2

Thanks

#cloud blog

Search This Blog

The triangle of doom

Comments

Post a Comment