Service crash (multiple threads)

Arjan

Member²
We are using a Windows Service Application with two threads each with its own TDataModule and its own TOracleSession. Each thread tests the connection with Oracle every minute (ini-file setting) by using CheckConnection(True).

If the connection is lost or not available then the threads wait for 10 minutes and try to connect again using again CheckConnection(True).

The following situation occurs:
During the backup (every weekend) of the database both threads fail to make a connection and return the value ccError.

After 10 minutes the threads try to make a connection again. One of the threads fails to make a connection, the return value of the CheckConnection is ccError.

The second thread (also using CheckConnection, same function, same base class) does not return a value for this function and appears to be hanging (pretty sure after 2 days waiting).

After 10 minutes the first thread tries to make a connection again, this thread now also fails to make a connection and also appears to be hanging.

We are using Delphi 2005 and Oracle 9 on a Windows 2000 Server. The property ThreadSafe of the TOracleSession is set to True (True or False does not seem to make a difference).

If anyone has any ideas how to solve this problem please let us know. We are now restarting the service after every weekend.

Thanks
 
We have followed your advice and added the procedure. We still have the same problem, but now we have received an AV in one of the Oracle dll's.
Below the logging of the failure.

Everything is still ok and query is being executed
14-02-2007 11:06:39 - TOracleQuery Start Query.Execute
14-02-2007 11:06:39 - SQL = select 'x' from dual
14-02-2007 11:06:39 - Result =
14-02-2007 11:06:39 - TOracleQuery End (1 record processed)
14-02-2007 11:06:39 - SQL =
14-02-2007 11:06:39 - Result =
14-02-2007 11:06:39 - Duration = 0,01

This is tread number 1
Below shows the first time the database is not available
14-02-2007 11:07:09 - TOracleQuery Start Query.Execute
14-02-2007 11:07:09 - SQL = select 'x' from dual
14-02-2007 11:07:09 - Result =
14-02-2007 11:07:09 - TOracleQuery End
14-02-2007 11:07:09 - SQL =
14-02-2007 11:07:09 - Result = ORA-12571: TNS:packet writer failure
14-02-2007 11:07:09 - Duration = 0,02
14-02-2007 11:07:09 - TOracleSession DmDOAConnection.sesDOACodac Start Session.LogOff
14-02-2007 11:07:09 - SQL =
14-02-2007 11:07:09 - Result =
14-02-2007 11:07:09 - TOracleSession DmDOAConnection.sesDOACodac End
14-02-2007 11:07:09 - SQL =
14-02-2007 11:07:09 - Result =
14-02-2007 11:07:09 - Duration = 0
14-02-2007 11:07:09 - TOracleSession DmDOAConnection.sesDOACodac Start Session.LogOn as @TRIA
14-02-2007 11:07:09 - SQL =
14-02-2007 11:07:09 - Result =
14-02-2007 11:07:09 - TOracleSession DmDOAConnection.sesDOACodac End
14-02-2007 11:07:09 - SQL =
14-02-2007 11:07:09 - Result = ORA-12500: TNS:listener failed to start a dedicated server process
14-02-2007 11:07:09 - Duration = 0,06

This is thread number 2
Below the first tim the database is not available
14-02-2007 11:07:09 - TOracleQuery Start Query.Execute
14-02-2007 11:07:09 - SQL = select 'x' from dual
14-02-2007 11:07:09 - Result =
14-02-2007 11:07:09 - TOracleQuery End
14-02-2007 11:07:09 - SQL =
14-02-2007 11:07:09 - Result = ORA-12571: TNS:packet writer failure
14-02-2007 11:07:09 - Duration = 0
14-02-2007 11:07:09 - TOracleSession DmDOAConnection_1.sesDOACodac Start Session.LogOff
14-02-2007 11:07:09 - SQL =
14-02-2007 11:07:09 - Result =
14-02-2007 11:07:10 - TOracleSession DmDOAConnection_1.sesDOACodac End
14-02-2007 11:07:10 - SQL =
14-02-2007 11:07:10 - Result =
14-02-2007 11:07:10 - Duration = 0,01
14-02-2007 11:07:10 - TOracleSession DmDOAConnection_1.sesDOACodac Start Session.LogOn as @TRIA
14-02-2007 11:07:10 - SQL =
14-02-2007 11:07:10 - Result =
14-02-2007 11:07:10 - TOracleSession DmDOAConnection_1.sesDOACodac End
14-02-2007 11:07:10 - SQL =
14-02-2007 11:07:10 - Result = ORA-12500: TNS:listener failed to start a dedicated server process
14-02-2007 11:07:10 - Duration = 0,05

This is thread 1 again
We are trying to reconnect, the database could be up and running again
14-02-2007 11:07:39 - TOracleSession DmDOAConnection.sesDOACodac Start Session.LogOn as @TRIA
14-02-2007 11:07:39 - SQL =
14-02-2007 11:07:39 - Result =
14-02-2007 11:07:39 - TOracleSession DmDOAConnection.sesDOACodac End
14-02-2007 11:07:39 - SQL =
THIS IS OUR PROBLEM :mad:
14-02-2007 11:07:39 - Result = Access violation at address 61915724 in module 'orantcp9.dll'. Read of address 00000000
14-02-2007 11:07:39 - Duration = 0,01

This is thread 2 again
The connection cannot be made, probably because de DLL is still being used by another process (or something)
14-02-2007 11:07:40 - TOracleSession DmDOAConnection_1.sesDOACodac Start Session.LogOn as @TRIA
14-02-2007 11:07:40 - SQL =
14-02-2007 11:07:40 - Result =

And again tread 1
Now we have got the same error for thread 1
14-02-2007 11:08:09 - TOracleSession DmDOAConnection.sesDOACodac Start Session.LogOn as @TRIA
14-02-2007 11:08:09 - SQL =
14-02-2007 11:08:09 - Result =
 
Arjan, instead of calling CheckConnection maybe you can try to execute a noop; (begin null; end;) and if fails recreate the session object.

Maybe it works better this way.
( If something is not working just take a bigger hammer )
 
It is difficult to find the cause of such an access violation in an internal Oracle DLL. You could try to recreate the session as suggested, or even free all DOA components, call OracleCI.DLLExit (which unloads the Oracle Call Interface Library), and recreate the DOA components. The Oracle Call Interface Library will automatically be reloaded during the first logon.
 
We could free everything if we got the AV, but we don't. This is because CheckConnection catches all exceptions.
So we do not know when we have to create a new Datamodule. We only see in our logging that there is nothing happening.
So we probably do not have to or can not use CheckConnection, but create our own function to determine the connection.
Even so it is still a bit strange that after the LogOn we got an AV. Of course we still have the second thread, which does not know the first thread had crashed. I hope that recreating the session object from the first thread and freeing the OCI library is enough to reconnect both threads.

We will take a bigger hammer and try to make this work.
 
Arjan,
I had once a problem with DOA when an oracle session drops connection. The only explanation I had was that DOA is trying to recover from error and release some objects (correct behavior), but because from oracle view point there is nothing to recover the program crash with AV.

I was never able to confirm this problem but based on my observations I realized that in this case a new session may solve this kind of problem because the previous session can still use some "handels" which are already dropped by oracle. This means that from my view point the problem is more in oracle (OCI) because if something was dropped it should throw errors and not access violations.

Anyway you have to pay attention on memory allocations.
 
I agree!

So we have changed and rebuild our source code. Now we have to wait for the backup to start, so tomorrow we will know if this is the solution or work around (whatever you like).
 
Very similar errors (AV inside OCI) I had, when executed TOracleSession.LogOn() (and LogOff too) at the same time inside many separate threads.

My solution was to create mutex which prohibits more than one logon/logoff at the same time.

Bertrand
 
We have had the same problem. After some time we received AV (multiple) when opening and closing the connection to check if the DB is still available.
This is the reason why we are using CheckConnection, this solved our problem.

About our new problem, it seems that unloading the dll helps. We are now preparing the test for the weekend (major backup). We will keep you informed.
 
Well, it did not work. We have tried to unload the dll. That seems to work fine, but afterwards both threads again cannot connect with the database.

We have also tried a service with one thread, this is working just fine. The thread is still running and there are no AV's. So the problem is most likely caused by the two threads.

Is it possible that one thread is blocking the other one? Could this be caused by the DOA components? Or should we search in the Oracle dll for the problem?
 
Threads do not block each other, unless they are using components that are linked to the same TOracleSession instance. Only TOracleSession.LogOn/LogOff will block the same call from another TOracleSession instance.
 
We have now created a mutex before making connection with the database.

If the LogOn/LogOff is causing the blocks, this might indeed do what is needed (already said by Blogin, thanks!)
 
The mutex did not work.... well, it works just fine, but it does not fix our problem.

After we have had a crash the DDL is released. Afterwards we are trying to make a connection.
The first thread is making a connection using the mutex but the connection is never made, therefor the mutex is never released.
So when the second thread tries to make a connection it cannot enter the mutex and never makes a connection because it recieves a time-out given by the mutex.

Any ideas anyone?
 
Back
Top