Oracle has AWR now, but it took them a long time to get to AWR. TimesTen does not have anything equivalent to AWR and we won't have for some time yet. To monitor and tune TimesTen you use a combinaiton of the ttIsql 'monitor' command (or querying sys.monitor from application code), the output of various builtin procedures, viewing query plans, O/S level metrics for CPU utilisation, I/O etc. and instrumentation in the application (you do have instrumentation in your application, right?). There is no one thing that you run for a while and then it nicely tells you where your problems may be. Some of this informaiton is also available via the EM plugin.
You have to capture the various available metrics and then perform an iterative analysis and tuning process.
Chris, Thank you for your detail response. Please find the below answers to your questions.
1. Full CPU utiliisation is not necessarily an issue. With a traditional client/server RDBMS such as oracle where the application and DB are on different machines the application is often waiting for a response from the databse. Hence (a) the CPU utilisation onb the application machine tends to be lower (lots of waiting which does not use CPU) and (b) a relatively large number of concurrent connections (app -> DB) are needed to get good overall throughput. If you are using direct mode, neither of these apply to TimesTen; database and application processing occur on the same machien resulkting in higher CPu consumption, there is no 'waiting' so again CPU usage is higher and lastly you do not want to jhave lots of concurrent connections as this just intorduces contention both within the database (locks and latches) and also at the OS level (scheduling).
[Response] Yes, I agree with you that having direct mode utilizes the application machine CPU and basically this is good feature which increases the performance. But it continuously uses all CPUs at 100% (checked with top command). I have 2 * Quad core CPU. When we start the load, it works fine (50-60% CPU Utilization). But after 10-15 mins, CPU usage shoots up very badly. All 8 cores gets 100% utilized and because of that my application through put comes down.
2. Have you verified the query plans for all SQL statements being executed by the application? A bad query plan can result in low performance and high CPU usage.
[Response] We have checked plan for every queries. Each query (select, update) is going through index scan. Insert query is also happening within TT, not passing through.
3. I see PASSTHROUGH_COUNT: 338. PassThrough is a great convenience feature but it is not intended for high performance and is quite 'expensive'. If you're application makes a lot of queries that need to run against Oracle it is better (i.e. cheaper and faster) to have a separate connection direct to Oracle for those.
[Response] I have few tables which are no cached in timesten but needs to be accessed when my application starts. But these tables are not accessed during my load test. This count does not increases during my load test (checked in monitor).
4. These stats show LOG_BYTES_TO_LOG_BUFFER: 5502728. Either this workload is 99.9% read or these stats represent a very short amount of time. For stats to be useful you need to run a realistic workload and then sample the 'monitor' stats a few times at 60 second intervals while the workload is running. Any kinf of meaningful performance test needs to run for many minutes (really 10s of minutes). You can't really say anythign useful from a test that runs for just a minute or two (or less).
[Response] Application works perfectly fine for initial 10 mins, it goes down after 10-15 mins. I am testing the application with 2000 TPS (5 concurrent request, per millisecond).
During load, each connection have 3 select queries, 2 update queries and 1 insert query.
5. How many cores does this machine have? ANy hyper-threading? How many concurrent connections are there to TimesTen and how many of those are actually active (i.e. doing database work)?
[Response] 8 cores. 400 JDBC connection from application to timesten, but concurrently 5 active connections.
Okay, I think that the statemrnt that 'after 10-15 minutes the CPU usage incrwases dramatically' is very relevant. The default checkpoint interval for TimesTen is 10 minuts. When there is significant write activity in TimesTen then a checkpoint can be both CPU and I/O intensive. You should also note that there are other activities in TimesTen that can be CPU intensive:
- Application connections; when there are no wait events (Durable Commites, RETURN RECEIPT/TWOSAFE replication) an application connection can easily consume 100% of one core if there is work to do.
- If READONLY caching or cache grid is being used the cache agent can be CPU intensive especially during large autorefresh operations.
- If replication or AWT caching is being used the replicaiton agent can be CPU intensive. This is especially true if parallel replication or parallele AWT is used.
- The log flusher thread in the sub-daemon can be very CPU intensive (as well as very I/O intensive) under heavy write workloads
- Checkpointing can be very CPU and I/O intensive when three is a significant write workload
So, you need to (a) ensure that you have enough CPU power/cores to handle the workload and (b) ensure you have enough I/O capability to suppoort the checkpointign and logging. Note that if your log and checkpoint files are on the same storage area then this may really hurt you (unless it is a high bandwidth RAID-0 stripe across a number of fast disks) depending on the database write workload.
I would suggest that you
(a) try and separate logs and checkpoints files onto different physical stirage
(b) during a test run, use the ttCkpHistory builtin (from a ttIsql connection) to see if you can correlate checkpoitn activity with the CPU usage peaks
(c) you can also try a test run with checkpointing disabled (CkptFrequency=0, CkptLogVolume=0) to see if this changes anything (but beware of running out space in the log filesystem since logs won't be purged without checkpointing)
If it seems that checkpointing is the culprit then it is likely you are short on either CPU horsepower or I/O bandwidth or both. Try separating logs and checkpoints as previously noted. Try running fewer concurrent active application connections/threads/processes. Try 'throttling' checkpoints using CkptRate or ttCkptConfig.
Okay, doesn't look like checkpointing is the issue. We need to dig a little deeper. Can you please provide the following information. There is a fair bit of stuff so if you prefer to provide an e-mail address then I can contact you directly.
1. Output of the ttVersion command.
2. ttStatus taken when the system is under load but CPu usage is not excessive.
3. TimesTen latchstats captured when the system is under load but CPU usage is not excessive:
ttXactAdmin -latchstats off DSN
ttXactAdmin -latchstats clear DSN
ttXactAdmin -latchstats on DSN
ttXactAdmin -latchstats off DSN
ttXactAdmin -latchstats show DSN -- capture output from this one
4. ttStatus and latchstats (as per 2 and 3) when the system is under load and CPU usage is at 100%
5. ttSchema output for your datastore
6. Current DSN settings from sys.odbc.ini
7. Some details on the application; such as is it C, C++, Java? What API(s) does it use? Does it use prepared and parameterised statements? What does a 'transaction' look like in SQL terms. etc.
Thank you for your guidance on mail.
We found memory leakage in our application, because of what result was not coming.
We are fixing the code, and will start the load test again.
I will put comments in this post again, if I face any new issue.
Thank you very much for the help.
We have resolved the Application issue. The load is not hanging now.
Yesterday I tried to ran it for 4 hours, and it ran successfully.
But problem is now with the performance.
During load, I am getting lot of spikes in query time and commit time.
Below are 2 scenarios I have tested for total 3,000,000 request with 500 TPS
1. Replication agent stopped
- Performance is much much better, all most 95% performance achieved (each request below 20 milli second)
- 5% is because of spike in query execution time or transaction commit time (I have printed the time in my application log)
- Passthrough count does not increases so there are no queries during load which gets pass through
2. Replication agent started
- Very poor performance, 65% performance achieved
- vmstat 2 (heavy swap-in and swap-out, i am not sure if this can be the concern)
- Poor query execution time and commit time
- Passthrough count continuously increases so assuming that this has to be because of replication agent
So in conclusion, when i keep replication agent on, my performance decreases drastically.
Please guide me what needs to be tuned.
- vmstat 2 (heavy swap-in and swap-out, i am not sure if this can be the concern)
This is definitely a major concern!
How much physical RAM does this machine have? How much is allocated to the JVM? What do O/S tools such as 'ps' and 'top' show for the mejkmory usage of the repagent while the workload is running?
As already noted, passthrough arising from the replication agent is not related to application performance and should be ignored.
Please post the curent sys.odbc/.ini settings (in caas they have changed since the previous post).
Also, are you using TT to TT replication, AWT cache groups or both?
When running the workload with the replication agent active can you please capture the output of the ttIsql 'monitor' command several times at 1 minute intervals and post it. Also please post the output from vmstat 5 5 for several minutes.
Okay, there are two causes for condern here.
Ther is significant, continual pahging going on. This is very likely impacting things and could be a cause of elongated response times. You need to investigate what is causing the overall memory usage to be so large that the system has to continually page. You can e.g. use 'top' and sort on memory size (RES) and see what is increasing over time. Or you can use something like ps -eo pmem,pcpu,args | sort -k 1 -r | less.
We generally recommend that you should ensure that the memory allocated to the TimesTen datatsore is permanenetly locked (wired) into physical memory so it cannot be paged. To do this you need to add MemoryLock=4 to the DSN settings in sys.odbc.ini. You also need to configure the O/S to allow enough locked memory and to allow the TimesTen instance admin user to use memory lock (consult Timesten Installation and Operations gides for more details). Note that while it is stringly recommended to use MemoryLock=4, this is not a universal panacea; if the system is short on memory and having to page this will prevent TimesTen memory being paged which means that other memory will be paged more. The ultimate solution is to ensure enough memory so that there is very little paging occurring.
A secondary, but major, concern is this:
This shows that after a while the replication agent delivering AWT changes to Oracle is unable to keep up and a large backlog of AWT changes is accumulating in TimesTen. This situation would be very bad in production. The usual cause of this is that (a) by default Times|Ten uses a single connection to Oracle DB to push AWT changes and (b) Oracle is unable to accept the changes fast enbough over this single connection. In TimesTen 11.2.2 you can configure parallel AWT which will use multiple connections to push changes to Oracle. Depending on the data and commit depenencies between transactions on the AWT tables this may give significant improivements to AWT throughput. You need to investigate this also since the current setup cannot cope with thsi production load if it persistnes for a significant time.