Imagine you encounter an issue with your database and you need to collect relevant trace and diagnostics files, maybe even across several nodes. Sounds familiar? If so, you probably also know how time consuming and complex this task can be.

 

Well, do you know that there is a tool that can do this task for you? It is called Trace File Analyzer Collector or short TFA and it is applicable to both RAC and non-RAC databases.

 

As systems supporting today's complex IT environments become larger, more complex and numerous, it can be difficult and time consuming for those managing these systems to know which diagnostics to collect.  In addition, systems might be distributed among different data centers. Hence, when an issue occurs, it may take several attempts and iterations before all relevant diagnostics files are uploaded to Oracle Support  for troubleshooting.

 

Before we go into the details of TFA, let's have a look at a traditional diagnostic lifecycle:

Traditional_diagnostic_lifecycle.png

 

Trace File Analyzer Collector (TFA) addresses exactly this issue. It is designed to solve these problems by providing a simple, efficient and thorough mechanism for collecting ALL relevant first failure diagnostic data in one pass as well as useful interactive diagnostic analytics. TFA is started once and collects data from all nodes in the TFA configuration.  As long as the approximate time that a problem occurred is known, one simple command can be run that collects all the relevant files from the entire configuration based on that time. The collection will include Oracle Clusterware, Oracle Automatic Storage Management and Oracle Database diagnostic data as well as patch inventory listings and operating system metrics.

TFA_collector_process.png

Business Drivers for Implementing TFA

 

Four key business drivers should be considered in a decision to implement Trace File Analyzer Collector.

 

  • Reduced Costs
    Any problem that leads or may lead to service interruption costs money and binds resources.  The faster a failed service can be restored the sooner normal operation can be resumed and IT staff can get back to more productive activities. Spending time on collecting and uploading diagnostic data over and over again due to the lack of knowledge about what data is needed or because of time pressure, the resolution cycle is lengthened.  TFA can help compress this aspect of the cycle.
  • Reduced Complexity
    Collecting the right data is crucial to efficient analysis of diagnostic data. In moments of distress collecting the relevant data is often replaced by collecting all, which also includes unnecessary diagnostic data. Analyzing unnecessary data means extra circles to determine what was provided. TFA helps eliminate those extra cycles by providing a standardized way of collecting the relevant diagnostic data across all Oracle environments, supporting nearly all configurations.  TFA also comes bundled with the Support Tools Bundle which consists of tools often requested by Oracle Support for collecting additional diagnostics. A convenient command line interface is provided for invoking those tools.  This approach eliminates the need for users to download and install them separately.
  • Increased Quality of Service
    Using TFA’s pruning algorithms applied at the time of the collection and upload phase allows Oracle Support to provide much quicker responses and more valuable support. Using TFA’s one file per host transfer accelerates root cause analysis by avoiding multiple communications between Oracle Support and the customer.
  • Improved Agility
    Using Trace File Analyzer Collector means not having to worry about what data to collect. Once setup and configured it can either be run on a automated event driven basis or on demand, leaving enough time for IT personnel to focus on more productive tasks.TFA enables first level production support personnel to upload diagnostic data just as efficiently and as precisely as senior support staff .

 

 

 

TFA Architecture and Configuration Basics

 

TFA is implemented by way of a Java Virtual Machine (JVM) that is installed on and runs as a daemon on each host in the TFA configuration.  Each TFA JVM communicates with peer TFA JVMs in the configuration through secure sockets.  A Berkley Database (BDB) on each host is used to store TFA metadata regarding the configuration, directories and files that TFA monitors on each host.

 

Supported versions (as of Sept. 2015) of Oracle Clusterware, ASM and Database are 10.2, 11.1, 11.2, 12.1.  TFA is shipped and installed in any 11.2.0.4 and 12.1.0.2 Grid Infrastructure Home. If using any other version, TFA must be downloaded and installed separately following the instructions in My Oracle Support Note 1513912.1

 

Supported platforms (as of Sept. 2015) are Linux, Solaris, AIX, HP-UX

 

Supported Engineered Systems (Sept. 2015) are Exadata, Zero Data Loss Recovery Appliance (ZDLRA) and the Oracle Database Appliance (ODA).  Support on engineered systems includes database and storage servers.

 

Use the My Oracle Support (MOS) document  to download and monitor for updates on version and platform support.

 

For supported versions other than 11.2.0.4 and 12.1.0.2, Oracle recommends that TFA be downloaded from the My Oracle Support document and installed as part of a standard build.  Oracle also recommends that the version distributed with 11.2.0.4 and 12.1.0.2 be upgraded to the latest version downloadable from My Oracle Support.

 

Please note, that TFA is also useful for  non-RAC single instance database servers as well and can be  downloaded from the My Oracle Support document mentioned above and installed manually.

 

A simple command line interface (CLI) called tfactl is used to execute the commands that are supported by TFA.  The JVMs only accept commands from tfactl. tfactl is used for monitoring the status of the TFA configuration, making changes to the configuration and ultimately to take collections.

 

TFA is a multi-threaded application and its resource footprint is very small.  Usually the user would not even be aware that it is running.  The only time TFA will consume any noticeable CPU is when doing an inventory or a collection and then only for brief periods.

 

Collections are initiated from any host in the configuration. Here’s an example which would perform a cluster-wide collection for the last 4 hours for all supported components:

$ tfactl diagcollect

 

The initiating host’s JVM communicates the collection requirement to peer JVMs on other hosts if files are needed from other hosts in the configuration. All collections across the configuration are run concurrently and copied back to the TFA repository on the initiating node.  When all collections are complete, the user can upload them from a single host to an Oracle Service Request.

 

The TFA repository can be configured on a shared filesystem or locally. In case a shared file system is used, the collection result of each host’s collection is stored in specific subdirectories as soon as the collection is complete.  If the TFA repository is configured locally instead (on each host) each collection result is copied to the initiating host’s repository before being deleted from each remote host.  In either case it is convenient for the user to obtain the files needed for a Service Request from a single location that is listed at the end of the output for each collection.

 

Managing TFA is simple. For example, adding or removing nodes to or from the TFA configuration after initial install is simple and only required as configuration such as the cluster topology changes.  New databases on the other hand that are added to a configuration will be discovered automatically.

 

Under normal operating conditions TFA spawns a thread to monitor the end of each alert log in the configuration – Database, ASM, and Clusterware.  TFA monitors for certain events such as node or instance evictions, and certain errors such as ORA-00600, ORA-07445, etc.  TFA stores metadata about these events in the BDB for future reference.

 

TFA runs an auto-discovery and a file inventory periodically to keep up to date on databases and diagnostic files.  A file inventory is kept for every directory registered in the TFA configuration and metadata about those files is maintained in the BDB.  Examples of the metadata kept are file name, first timestamp, last timestamp, file type, etc.  Differing timestamp formats are also “normalized” as the format of timestamps can vary from one file type to the next.

 

 

Functionality Matrix

 

Version

System Type

Comments

10.2.0.4, 10.2.0.5, 11.1.0.7, 11.2.0.1, 11.2.0.2, 11.2.0.3, 12.1.0.1RACDownload latest version from Document 1513912.1
11.2.0.4RACInstalled by default in $GI_HOME/tfa, $GI_HOME/tfa/bin/tfactl (tfactl -h for options)
12.1.0.2RACInstalled by default in $GI_HOME/tfa,  $GI_HOME/bin/tfactl (tfactl -h for options)
11.2.0.4 JUL2014 PSU upwardsRACInstalled by default in $GI_HOME/tfa,  $GI_HOME/bin/tfactl (tfactl -h for options), Analytic functionality added (tfactl -analyze -h for options)
11.2.0.1,11.2.0.2,11.2.0.3, 12.1.0.1ExadataDownload latest version from Document 1513912.1
11.2.0.4ExadataDatabase Server collection support only
11.2.0.4.6 Bundle Patch (APR2014 PSU)ExadataStorage Server collection support added
11.2.0.4.9 Bundle Patch (JUL2014 PSU)ExadataAnalytic functionality added (tfactl -analyze -h for options)
12.1.0.2ExadataInstalled by default in $GI_HOME/tfa,  $GI_HOME/bin/tfactl (tfactl -h for options), Database and Storage Server collections and analytic functionality

 

 

Support Tools Bundle

 

TFA includes tools commonly requested or recommended by Oracle Support as of version 12.1.2.3.0.  The tfactl command interface is used to manage the tools.  Users can start, stop, status and deploy the tools in a standard way using tfactl commands and without needing to download and install them all separately as in the past.  Using the bundled tools from within TFA allows for a standard command interface, deployment locations and collection of output by TFA.

 

NOTE:  Oracle recommends that TFA Collector be kept at the latest version using downloads from My Oracle Support Document 1513912.1 which is updated frequently in order to make new features and other improvements available.

 

 

 

TFA  Collections

 

TFA only collects files that are relevant considering  the time of the problem. All the user has  to know is the approximate time of the problem  which can then be provided using a time modifier in the CLI.  TFA will then gather all the first failure  diagnostic files Oracle Support  needs in order to triage the problem. The  simplest and most thorough form of collection is  for the default time period which  is for the  last 4 hours

Example #1 :  $ tfactl diagcollect

 

Alternatively, if a more precise time is known, for  example a problem which occurred within the last hour

Example #2: $  tfactl diacollect -since 1h


The command shown in Example #1  above initiates a TFA run collecting all the relevant diagnostic data for an  issue that occurred within the last four hours.  This command will perform a configuration-wide  collection of all OS, Clusterware, ASM, Database, Cluster Health Monitor and OS  Watcher files that are relevant to that time, because no further  restriction such as a specific node or component are mentioned. In  addition, TFA will collect  other relevant files and configuration data such as patch inventories. Similarly the command shown in Example #2  would collect all the files modified within the last hour, most likely  resulting in a smaller and more targeted collection.

 

tfactl provides a series of command line options to  limit the collection to a subset of hosts or components.  However, if the root  cause is completely unknown, it might be best to simply collect as much data  during the relevant time as possible, so nothing  will be missed. Reducing  the data to collect is not as efficient as being precise about the time of the  incident. The more precisely the time of the problem can be  specified the smaller the collection will be. Likewise, the more specifically a  problem can be narrowed down the more targeted a  collection can be in terms of nodes and components.

 

The TFA inherent pruning of larger  files by skipping data from traces and logs that is outside of the time  specified is another approach to make data collection as  well as analysis more efficient. Again, the more precisely the time can be specified the more the files  can be pruned. The design goal of TFA is  to make the collections as complete, as relevant and as small as possible to  save time in collecting, copying, uploading and  transferring files for both the customer and for Oracle Support,  following a simple yet very efficient approach;  a file that was not modified around  the time of the incident is unlikely  to be relevant.

 

When diagnostic collections are taken the  first step taken automatically is  a "pre-collection" inventory in case any files have been  modified or created since the last periodic inventory. In the case of a pre-collection inventory only files for the specified databases and/or components are inventoried.
 
TFA can be  configured to take collections automatically when registered events  occur and store them in the repository. Examples  of registered events would be node evictions and ORA-00600 errors among others  documented in the TFA User Guide. There  is a built-in flood control mechanism for the  auto-collection feature to prevent a rash of duplicate errors or events from  triggering multiple duplicate collections. Should the repository reach its configurable maximum size TFA will  suspend taking collections until space is cleared in the repository using the purge  command.

Example #3: # tfactl  purge -older 30d

In example #3 any collections older than 30 days  will be deleted thereby freeing space for new collections. Note that the purge command must be run by  root or under sudo control.

 

 

 

TFA Analytics

 

TFA has built-in analytics  enabling  the user to  summarize the results from Database, ASM and  Clusterware alert logs, system message files and  OS Watcher performance metrics  across the configuration. Errors,  warnings and search patterns can be summarized for any given time period for  the entire configuration without needing to access the files  on each node of the configuration. In a  similar way, OS Watcher data can also be summarized for the  entire configuration for any given time period.

 

 

Further Information

 

For further information, please refer to

 

For any help and support with TFA, please come and chat to us via the RAC Community.