How to investigate random hangs in ODP.NET — oracle-tech

    Forum Stats

  • 3,715,999 Users
  • 2,242,927 Discussions
  • 7,845,731 Comments

Discussions

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

How to investigate random hangs in ODP.NET

lubzeli
lubzeli Member Posts: 4
edited June 2018 in ODP.NET

Hello.

I'm hoping someone more experienced can give me some hints how to investigate a weird issue - someone who faced a similar issue and was able to find the cause. We have multi-threaded application in production environment that processes fixed format files and does some simple validations which are all in C# so on DB side we basically have only CRUD operations. For the last year and a half the current version worked fine but recently the client started to experience hangs. I made a full memory dump of the hanging process and looked at it using windbg and found that some threads got stuck on ODP.NET calls. Here are a few examples:

Thread 1 .NET:
0b95eb38 7c8283ac [NDirectMethodFrameGeneric: 0b95eb38] Oracle.DataAccess.Client.OpsSql.ExecuteNonQuery(IntPtr, IntPtr ByRef, IntPtr ByRef, IntPtr ByRef, IntPtr, Int32 ByRef, Int32, Int32, Int64 ByRef, Oracle.DataAccess.Client.OpoSqlValCtx* ByRef, System.String, IntPtr ByRef, IntPtr[], System.String[], Oracle.DataAccess.Client.OpoMetValCtx* ByRef, Int32)
0b95eb80 0597a630 Oracle.DataAccess.Client.OracleCommand.ExecuteNonQuery()
0b95ecf0 07c88615 CIP.Repositories.Oracle.DomainRepository.DomainExists(System.String, Param[])
0b95ed0c 07c89a4a CIP.Repositories.Oracle.DomainRepository.ExistsCountry(System.String)
0b95ed1c 07c899e0 CIP.Control.Validators.DomainValidator+<>c__DisplayClass16.<ExistsCountry>b__15()

Thread 2 .NET:
0c4cf800 7c8283ac [InlinedCallFrame: 0c4cf800] Oracle.DataAccess.Client.OpsTxn.Commit(IntPtr, IntPtr, Oracle.DataAccess.Client.OpoTxnValCtx*)
0c4cf7fc 066930fd Oracle.DataAccess.Client.OracleTransaction.Commit()
0c4cf860 06692fe3 CIP.Repositories.Oracle.Infrastructure.OracleSession.CommitTransaction()
0c4cf868 07a8093c CIP.Control.Services.ServiceBase.WriteLuIntoRepo(CIP.NT4FilesModel.LogicalUnits.LuBase)
0c4cf8b4 08f2c6c3 CIP.BatchImporter.ProcExecGateways.BatchServices.ExecCntCpt(CIP.BatchImporter.ProcessMessages.CntCptInput)

Thread 3. NET:
0c2cf2e8 7c8283ac [NDirectMethodFrameStandalone: 0c2cf2e8] Oracle.DataAccess.Client.OpsCon.CheckConStatus(IntPtr, IntPtr, Int32, Int32 ByRef, Int32, Int32)
0c2cf308 064f5053 Oracle.DataAccess.Client.ConnectionPool.CheckLifeTimeAndStatus(Oracle.DataAccess.Client.OpoConCtx ByRef, Int32, Boolean ByRef, Int32, Boolean)
0c2cf3a0 064f0718 Oracle.DataAccess.Client.ConnectionPool.PutConnection(Oracle.DataAccess.Client.OpoConCtx ByRef, Boolean, Boolean, Boolean, Int32)
0c2cf3a4 064f4b21 [InlinedCallFrame: 0c2cf3a4]
0c2cf51c 064f47f8 Oracle.DataAccess.Client.OracleConnection.Close()

Thread 4 .NET:
0d42f19c 7c8283ac [InlinedCallFrame: 0d42f19c] Oracle.DataAccess.Client.OpsTxn.Commit(IntPtr, IntPtr, Oracle.DataAccess.Client.OpoTxnValCtx*)
0d42f198 066930fd Oracle.DataAccess.Client.OracleTransaction.Commit()
0d42f1fc 06692fe3 CIP.Repositories.Oracle.Infrastructure.OracleSession.CommitTransaction()
0d42f204 07a85087 CIP.ElabManager.ElaborationStatusManager.WriteElaborationErrors(CIP.ElabManager.ErrorManager, CIP.NT4FilesModel.LogicalUnits.LuBase)
0d42f234 08f2c6dc CIP.BatchImporter.ProcExecGateways.BatchServices.ExecCntCpt(CIP.BatchImporter.ProcessMessages.CntCptInput)

Looking deeper at the full stack and not just .NET stack I found that all threads are basically waiting to receive data which never happens (at least I hope I'm reading it right):

0b958b10 7c827ac9 ntdll!ZwWaitForSingleObject+0xc
0b958b14 71b21af5 mswsock!SockWaitForSingleObject+0x19d, calling ntdll!ZwWaitForSingleObject
0b958b50 71b2c517 mswsock!WSPRecv+0x203, calling mswsock!SockWaitForSingleObject
0b958bc8 71c094e5  ws2_32!WSARecv+0x77

The problem with Thread 1 can be mitigated using CommandTimeout but did not find a way to do the same for the other three so it is really not a solution.

It seems like a network problem to me but what kind on network problem would be responsible for this and didn't cause a error either on application side or database side (not 100% sure about this because I do not have access only customers DBAs do)? And how should I investigate such a complicated issue? It happens rarely maybe once in 100 000 executions of ODP.NET methods that communicate with the database but in time more and more threads get stuck which lowers performance and eventually at the end of processing the application remains stuck waiting for the stuck threads to finish which never happens.

Some basic info about the environment in case this is a bug which is known and I missed it:

Database:

Oracle 10.1.0.5

Some Linux based OS (never really cared about it while everything was working)

Application Server:

Windows 2003 R2

ODP.NET 11.1.0.7.0

.NET Framework 2.0

Yes everything totally supported

Also the application uses connection pooling - Min Pool Size=5;Max Pool Size=120;Connection Timeout=60

I'll appreciate any suggestions.

Thank you.

Tagged:

Answers

  • AndrewSayer
    AndrewSayer Member Posts: 12,998 Gold Crown
    edited June 2018

    This immediately sounds like the application is waiting for the DB to do something for it, it could be any number of things, luckily Oracle is one of the most instrumented systems around so you just have to ask it.

    Does your application identify itself to the DB using module, action and/or client identifier? You could query what the session is currently doing in the DB using v$session, get the DBAs involved - they can query this and should know what they’re doing enough to follow the leads.

    If your application doesn’t identify itself, then you’ve messed up. But, with a bit of luck, if you can identify problem periods then depending on what licensing the client has purchased they should be able to see what was going on in the DB around the problem time and see what is likely to be your application “hanging”.

    You should be able to support your client sending you an AWR report (if licenced) or statspack report (which needs installing but is included in all DB licences as far as I’m aware). If you have someone knowledgable enough in your company (and you really should if you’re going to support it) then get them to give their opinion, otherwise you might have some luck here (or probably the General Database space). Just make sure the report period is covers the problem period and not much more.

    Can you replicate the issue in house? It’s a lot easier when you can eliminate the back and forth between client and vendor like that. Of course, you’d need an exact replica of the tables they’re using (including data types, indexes, constraints) and be able to replicate the usage.

    If you want my initial stab in the dark, they are writing lots of data very fast and their log writer is unable to keep up so is not able to respond to commits fast and will block DML from increasing the redo back log. This will be obvious by the wait events that happen during the problem period. This you won’t be able to replicate unless you can replicate their load reasonably well and have configured the same redo buffer and have the same log writing abilities.

  • lubzeli
    lubzeli Member Posts: 4
    edited June 2018

    Sadly we cannot replicate the problem. It's just like you said we would need to replicate the conditions. We tried to replicate the problem by using a database on low performance machine and letting the application go wild on it but all we achieved was low processing speed with a ~95% utilization of DB resources. No problems or crashes except you couldn't do much on the DB server.

    It is true that the application can create a lot network traffic and hit the database hard with requests and there are a lot of commits but it is possible that things get so bad that you perform a SELECT on a table with like 200 rows which is never modified only queried and you never get a response back? And it happens in like 1 out of 100000 times? And because of that one time the thread from which the request was made gets stuck indefinetly waiting for a result from DB until you kill the application. I would expect a error at least or some trace on database server explaining it. Unless there is some third party involved that can manipulate TCP/IP connections but what could it be to have this effect? Are there some OS limits that could be responsible?

    It's definitly the worst kind of problem to be solving especially without full access to one the environment where it occurs. It's why I posted here to get some ideas where to look. I'm sure I'm not the only one solving such obscure and random problem

    Speaking of ideas the one with v$session is a good one. While I can identify which sessions belong to the application I'm not able to map the ODP.NET connections to Oracle sessions and so I cannot tell which sessions belong to the connections in application where execution got stuck. I think I can figure this part out though.

    Thank you for your input.

This discussion has been closed.