OBIEE Performance Analytics: Analysing the Impact of Suboptimal Report Design

Version 20

    OBIEE can do complex processing on data that you request from a data source that may not support it (e.g., analytical windowing functions against data in Hive, or MS Access). Where OBIEE can, it will generally push all the complex work down to the database it’s querying from. If it can't, it will pull the data back in raw form and then do the work itself. And this is where the problems can start. Oracle ACE Robin Moffatt's article will help you avoid the headaches.


     

    by Oracle ACE Robin Moffatt ace.gif

     

    Oracle Business Intelligence Enterprise Edition (OBIEE) is a fantastically powerful analytical and reporting tool. One of the great things about OBIEE is its ability to run a report and get the data you've asked for, whatever it takes. It could be running a single query against a dedicated data warehouse; it could be federating the results of multiple queries across a mixture of transactional and analytical relational sources, or even Hadoop. OBIEE can do complex processing on data that you request from a data source that may not support it (e.g., analytical windowing functions against data in Hive, or MS Access). Where OBIEE can, it will generally push all the complex work down to the database it’s querying from. If it can’t, it will pull the data back in raw form and then do the work itself. And this is where the problems can start….

     

    Two of the common symptoms of the performance problem that I’m going to discuss here are:

     

    • Variance in response times: When a report is usually fast, but there are hotspots of slow performance
    • Disk space: OBIEE uses lots of disk space, sometimes running out

     

    The underlying cause of these problems can often be down to sub-optimal design at one--or several--points in the OBIEE implementation. It’s easy to understand how this came about. Reports are often built in a development environment with little or no data, and so long as they functionally work, the "non-functional" bit of performance doesn’t come into play. It's only once reports start to be run against proper data sets, and with concurrent users, that serious issues can start to arise.In this article we're going to have a look at what happens to performance when the BI Server is doing more than its fair share of the work. It will give us an understanding of the stresses that a badly designed report or RPD can put on the system, and what to look for when we do encounter this kind of performance problem. What we’re going to see is the following:

     

    • Suboptimal design can cause the I/O subsystem on the application server to become overwhelmed. Because I/O throughput is frequently overlooked in favor of the more obvious CPU and memory metrics when assessing a server’s capacity, it’s easy to be given the "all clear" even when it’s actually flat on its back.
    • Low User CPU does not mean that the system is not under pressure; you must also look at CPU Wait and Idle, as well as the load average.
    • Reports with large amounts of data returned to the end user require lots of memory and temp disk space for Presentation Services and temp disk space for the BI Server
    • Reports such as these, when run concurrently can easily exhaust the disk space available to the temp folder, resulting in aborted queries with errors such as:
      Query Status: [nQSError: 46073] Operation ‘write()’ on file
       ‘/u01/app/oracle/product/instances/instance1/tmp/OracleBIServerComponent/
       coreapplication_obis1/obis_temp/nQS_1229_557_52279963.TMP’ failed with error: (2) (.
      
      [nQSError: 46017] Sort has no work space in the work directory.
      [nQSError: 43119] Query Failed:
    • Lots of data moving throughout the stack is generally going to be inefficient and means increased network and disk I/O throughput.
          If you see this behavior with the execution of a single report, you need to consider whether you have the necessary bandwidth on the server when concurrent requests of a similar nature run, and also what this resource utilization means for other functions on the server (e.g. co-located ETL).

     

    To analyze the performance, we’re going to use examples run in SampleApp v406, but with the database hosted on a separate machine.

     

    What Happens When an OBIEE Query Runs?

     

    Before diving into the nitty-gritty of this, let’s take a moment to examine how an OBIEE analysis executes. Understanding this enables us to accurately describe a query’s progress through the stack and explain observations made. It also enables us to build a time profile of the overall response. A time profile is something that Cary Millsap discusses in detail in his excellent paper, Thinking Clearly About Performance, a lot of which inspired my talk, No Silver Bullets - OBIEE Performance in the Real World (you can find the slides and video online). A time profile is built based on the components of execution, which for OBIEE are broadly as follows. Note that we’re assuming a single analysis here; for multiple analyses on a dashboard much of this would run concurrently (which in itself can also trigger   problems):

     

    1. User opens a dashboard/hits refresh.
    2. Request passes through WebLogic Server to Presentation Services (Oracle Business Intelligence Presentation Server: OBIPS).
    3. Presentation Services (OBIPS) generates the Logical SQL for the analysis and sends it via ODBC to the Oracle Business Intelligence Server (OBIS).
    4. BI Server (OBIS) parses the Logical SQL through the RPD and generates one or more Physical SQL statements in order to retrieve the data needed.
    5. OBIS receives all the data it needs from the data sources and does any additional processing required. This could be stitching multiple data sets (federation) or combining (union/fragmentation), or processing a single data set if it couldn’t be done in the database (sort/ aggregate/filter /etc.).
    6. OBIS sends data back to OBIPS, where the report is generated and formatted and fed back to the client browser.

     

    image 2

     

    Understanding this basic flow is key to identifying bottlenecks in performance and understanding how to address any issues observed.

     

    Pulling Large Data Volumes from Source Database

     

    We’re going to have a look at how BI Server behaves when pulling back lots of data from the source database. This can come about for the obvious reason of building a very large report, or in the less obvious case when the BI Server can’t push all the processing to the data source (maybe it doesn’t support it, or there’s federation involved) and so the BI Server pulls back the data prior to aggregation and/or filtering and does it itself.

     

    Level 01

     

    To make this a simple example, I’m using a Direct Database Request (DDR) as it’s a nice, easy way to force OBIEE to run exactly the database query that you want it to. As an aside, this is exactly why DDR gets so frequently abused and you should note that I’m using it here simply to illustrate a point, not because it’s A Good Thing. Reports built with DDR are inflexible, they’re difficult to support, difficult to scale … the RPD is provided for a reason   – use it!

     

    The query we’re going to start by running is a pretty small one. It pulls back a bunch of data from one of the Oracle system catalog tables, all_objects. What the data contains is irrelevant; we just want a bunch of rows to pull back from the database to illustrate the point.

     

    select * from all_objects a

     

    2016-02-13_21-45-52.png

     

    In the results page, expand the table view to show all rows, which forces all the data to be pulled back through the stack (instead of just the first 25).

     

    This basic query ran in 7 seconds, as we can see from the nqquery.log:

     

    [2015–12–19T21:58:54.000+00:00]
    SQL Request, logical request hash:
    58a2d9dc
    SET VARIABLE QUERY_SRC_CD=’Report‘,PREFERRED_CURRENCY=’USD’;EXECUTE PHYSICAL CONNECTION POOL 
    “Sample Relational Connection” select * from all_objects a [2015–12–19T21:58:54.000+00:00]
    Sending query to database named Sample Relational Connection (id: ExecutePhysical Gateway), 
    connection pool named Sample Relational Connection, logical request hash 58a2d9dc,
     physical request hash 9d76179b: [[
    select * from all_objects a [2015–12–19T21:59:01.000+00:00]
    Query Status: Successful Completion
    Rows 74763, bytes 
    186010344 retrieved from database query id: ExecutePhysical Gateway [[
    Physical query response time 4 
    (seconds), id ExecutePhysical Gateway
    Physical Query Summary Stats: Number of physical queries 1, 
    Cumulative time 4, DB-connect time 0 (seconds)
    Rows returned to Client 74763 [[
    Logical Query Summary Stats: 
    Elapsed time 7, Response time 0, Compilation time 0 (seconds) [[

     

    Using the Rittman Mead Performance Analytics Dashboard we can easily analyze in one place the behavior of the OS and OBIEE during execution:

     

    Some particular points to note:

     

    • The heap size (memory) and use of temp on disk for Presentation Services (but not BI Server) both increase:


            The memory figures in the above chart are taken from the DMS counter heapSize, and are matched in the Linux proc stats:

      Presentation Services is responsible for generating and formatting the page that’s returned to the web browser, hence the increased memory and disk usage in working with these 75,000 rows of data.

    • There’s a burst of inbound network traffic at 21:59, marking the transfer of data fromthe database to BI Server. Around the same time, there’s disk write activity, which is the temp files being written by OBIEE as the data is received from the database.

     

    In this first example, there was no BI Server temp written on disk at all, but there is sawserver temp (48Mb at peak). After the query has run and the temp falls back down, there is still a notable chunk in the querycache folder (which is presumably the Presentation Services query cache):

     

    $ pwd
    
    /app/oracle/biee/instances/instance1/tmp/OracleBIPresentationServicesComponent/
    coreapplication_obips1
    
    $ du -sh ./*
    0 ./D_22sJ6EmumPlysgoC8jSWn_o8p3ci+z
    4.0K ./dxe
    4.0K ./J5h+oq80au7HtXDcOrlDhghash.tmp
    4.0K ./mobile
    4.0K ./multipartdoc
    4.0K ./obis_temp
    4.0K ./preview
    896K ./qmcache
    6.7M ./querycache_0
    4.0K ./querycache_1
    4.0K ./querycache_2
    4.0K ./sawcharts
    4.0K ./sawmaps
    912K ./sawrptcache
    904K ./sawvc
    

     

    Level Up

     

    Now, let’s double the number of rows pulled back by adding a Cartesian into our DDR:

     

    select *
    from all_objects, 
    (select 1
    from dual
    connect by level <= 2) ;
    

     

    From nqquery.log we can see the data has, as expected, doubled: nearly 150,000 rows and over 350Mb being pulled back from the database, through BI Server and Presentation Services:

     

    [2015–12–19T22:56:24.000+00:00] Sending query to database named Sample Relational 
    Connection (id: ExecutePhysical Gateway), connection pool named Sample Relational Connection, 
    logical request hash 5fff717a, physical request hash e4f82db3: [[
    select * from all_objects a , (select 1 from dual connect by level <= 2) ;
    
    [2015–12–19T22:56:33.000+00:00]
    Rows 149526, bytes 374413104 retrieved from database query id: ExecutePhysical Gateway [[
    Physical query response time 5 (seconds), id ExecutePhysical Gateway [[
    

     

     

    As before, the use of memory by Presentation Services (sawserver) goes up (by 550MB), as does its temp on disk. This time, we also see the BI Server (nqsserver) writing some temp to disk too (28Mb at max)

     

    By running the Linux tool iotop whilst report is running we can observe what the processes are up to in terms of disk throughput:

     

        TIME TID PRIO  USER  DISK READ  DISK WRITE  SWAPIN IO COMMAND
    
    22:56:27 14269 be/4 oracle 0.00 B/s 5.03 M/s 0.00 % 20.49 % nqsserver -quiet
    22:56:27 14368 be/4 oracle 0.00 B/s 2.91 M/s 0.00 % 13.22 % nqsserver -quiet
    22:56:27 14396 be/4 oracle 0.00 B/s 10.15 M/s 0.00 % 0.00 % sawserver
    

     

    So both BI Server and Presentation Services are writing to the disk at around 10Mb/sec. The implications of this we will see shortly.

     

    End of Level Boss

     

    We’ll stop beating around the bush now, and really chuck some work at the server. It’s the same query style as before, just with an even bigger Cartesian. In total we’re going to pull back 7.4 million rows from the database, totalling 1.4 Gb of data. It's important to note here that I've had to construct a DDR specifically to generate this load. Typically, the amount of data that OBIEE pulls back from the database in a single query is limited by good RPD design, good report design, and the row   limits set per database object in the RPD.

     

    Here’s the query launching:

     

    [2015–12–19T23:15:19.000+00:00] Sending query to database named Sample Relational Connection 
    (id: ExecutePhysical Gateway), connection pool named Sample Relational Connection, logical request hash 
    e4bfd857, physical request hash 6f8227e2: [[
    select * from all_objects a , (select 1 from dual connect
     by level <= 100) ;

     

    By default, you only get the first 25 rows of the table showing when you first run the report.

     

     

    Before expanding the table to pull back the rows, note that, at this point (before viewing all rows), there’s just the one BI Server (nqsserver) process writing to disk, presumably as it is still receiving data from the database:

     

        TIME TID PRIO  USER  DISK READ  DISK WRITE  SWAPIN IO COMMAND
    
    23:20:38 14269 be/4 oracle 0.00 B/s 8.42 M/s 0.00 % 32.68 % nqsserver -quiet

     

    There’s also just a single BI Server TMP file on disk (growing rapidly):

     

    Sat Dec 19 23:21:31 GMT 2015
    total 3.1G
    -rw——- 1 oracle oinstall 3.0G 
    Dec 19 23:21 nQS_14111_7_83719644.TMP

     

    Once you click the option to expand the table view and show the maximum number of rows, you'll see additional BI Server processes writing to disk, as well as Presentation Services (sawserver) appearing:

     

        TIME TID PRIO  USER  DISK READ  DISK WRITE  SWAPIN IO COMMAND
    
    23:21:53 14272 be/4 oracle 0.00 B/s 1976.32 K/s 0.00 % 12.11 % nqsserver 
    -quiet
    23:21:53 14269 be/4 oracle 3.86 K/s 1976.32 K/s 0.00 % 8.73 % nqsserver 
    -quiet
    23:21:53 14360 be/4 oracle 0.00 B/s 1111.68 K/s 0.00 % 7.86 % nqsserver 
    -quiet
    23:21:53 14396 be/4 oracle 0.00 B/s 18.96 M/s 0.00 % 0.00 % sawserver
    

     

    There’s also a second BI Server temp file that starts to get written to on disk as well:

     

    Sat Dec 19 23:22:03 GMT 2015
    total 3.2G
    -rw——- 1 oracle oinstall 3.1G Dec 19 23:21
     nQS_14111_7_83719644.TMP
    -rw——- 1 oracle oinstall 15M Dec 19 23:22 nQS_14111_8_84112206.TMP
    

     

    The system metric profile now becomes rather interesting:

     

     

    Because the query takes many minutes to run, it gives us the chance to gather some "steady state" figures, which we can assume is the system running at full pelt, pulling in all the millions of rows of data. Points of particular note are as follows:

     

    • The CPU is busy on c.10% user work, but also c.10% wait, which is usually indicative of it waiting for disk or network operations to complete.

     

     

    • The network is at a constant c.200 Kb/s input, and disk c.12Mb/s write.

     

     

    The database is on the same subnet as the application server with fast physical connection, so the maximum 200Kb/s input rate suggests that the constraint would be on the other end (i.e., the database reading/sending the data).

     

    • The disk subsystem is notably busy, hitting 100% capacity at points:
          pa10.png
    • Presentation Services (sawserver) memory use rose by 6GB:
         

     

    Also interesting is that a lot of the heavy activity takes place after the logical query completes at 23:22:

     

    [2015–12–19T23:22:05.000+00:00] Rows 7476300, bytes 1540786016 retrieved from 
    database query id: ExecutePhysical Gateway [[
    Physical query response time 172 (seconds), id ExecutePhysical Gateway [[
    Logical Query Summary Stats: Elapsed time 406, Response time 0, Compilation time 0 (seconds) [[

     

    Looking at the temp disk behavior, it seems the DB finished returning its rows at c.23:22, which matches the nqquery.log time of the logical query completing. However, temp and heap continue to grow for several more minutes, including disk I/O hitting 100% usage, and CPU Wait at c.25%. This is presumably Presentation Services taking the data from BI Server and preparing it to send back to the web browser.

     

    download.png

     

    Game Over

     

    Above, we saw how one “heavy” query behaves on the OBIEE application server. In real life, there’ll be multiple concurrent users on OBIEE, along with potentially other workloads (e.g., TimesTen on Exalytics), meaning that this level of resource consumption can have serious implications. If we now run two of these queries (modified slightly so that they don’t get a shared cursor in Presentation Services) as an example of concurrent usage on OBIEE, we completely flatten the server: the 1   minute load average is 76!

     

    Note that User CPU is extremely low, but so is Idle CPU, because it’s all on Wait.

     

    pa12.png

     

    What’s causing the CPU Wait? IO!

     

    pa13.png

     

    So even if User CPU, commonly looked at when doing a cursory check to see if a server’s “slow,” is not at 100%, it doesn’t mean that the server is not starved of resource.

     

    How Do I Know If My OBIEE System Has This Problem?

     

    It’s possible to get a rough idea which queries might warrant a bit of investigation by analyzing the data in the Usage Tracking table S_NQ_ACCT. For example:

     

    SELECT START_TS,SAW_SRC_PATH, CUM_NUM_DB_ROW, TOTAL_TIME_SEC
    FROM S_NQ_ACCT
    WHERE CUM_NUM_DB_ROW > 1000000;

     

    And if you don’t have Usage Tracking enabled? Go and enable it now. Seriously. It’s invaluable.

     

    Can We Fix It? Yes, We Can!

     

    Fixing this kind of performance problem requires careful and thorough examination of the report(s) in question and an understanding of what is causing the excess movement of data. From this, it should become clear what needs doing to optimize it and make it more efficient. The causes will generally fall into one of two categories:

     

    • The report shows only a fraction of the rows being pulled back from the database, but because of the RPD design and/or the report, the BI Server can’t pushdown to the database the summarization required and instead has to pull back a greater amount of data to then process on the BI Server and/or Presentation Services.
    • The user is deliberately trying to return large volumes of data (often to then just export to Excel).

     

    Quite often, once the evidence of the cause for poor performance is put in front of developers, they’ll rapidly find a way to fix it. They may have had a sneaking suspicion all along that what they’d built wasn’t as performant as it could be, or maybe realized that now’s the time to do that RPD modelling optimization that they’d been putting off for a future date.

     

    KIWI?

     

    You may wonder, given that I’ve clearly demonstrated a system bottlenecked on I/O to the local disk, why we don’t just get a bigger pipe for the disk. That is, a faster disk controller and/or faster disks (maybe SSDs), or maybe even RAM Disk, which I discuss below. The thing is, this misses the point. Known as K.I.W.I. (Kill It With Iron), throwing more server hardware at capacity issues is usually not a long-term solution if you want to build a system that’s scalable and economical. If you have inefficient queries running, running the same inefficient queries on bigger hardware leaves you with the same problem: inefficiency. As time goes by and people continue to build even bigger inefficient queries, or start to run them concurrently, the problems will come straight back. But, if you fix these problems at source, you regain system stability and you increase your available capacity without spending a dime!

     

    Restricting Database Executions by Row Count or Execution Time

     

    Using the BI Server query governor, it is possible to restrict the number of rows and/or execution time on the database. This is a good "fail-safe" to have in place if you're concerned that reports may be written that pull excessive data back to the point where it is going to impact the performance of the BI Server or source database.

     

    When the query governor kicks in you'll see your OBIEE query abort with a message such as :

     

    [nQSError: 43113] Message returned from OBIS.
    
    [nQSError: 43121] Max Row Limit Exceeded
    
    [nQSError: 43119] Query Failed:
    
    [nQSError: 43093] An error occurred while processing the EXECUTE PHYSICAL statement.
    
    [nQSError: 60008] The query for user 'weblogic' exceeded the maximum query governing 
    rows 42 from the database 'Sample Relational Connection'.
    

     

    The query governor can be configured per application role (or user, but this is not such a good idea from a maintainability point of view) in the RPD:

     

    2016-03-09_23-40-54.png

     

    As of 12c (12.2.1), the query governor is also configurable as a global default in the NQSConfig.ini file under DEFAULT_DB_MAX_ROWS and DEFAULT_DB_MAX_EXEC_TIME. You also have the option of setting it per session using the system session variable OBIS_DB_MAXROWS and OBIS_DB_MAXEXECTIME, which are present in 11.1.1.7, and in 12c can also be set as request variables.

     

    Note that the query governor limits applyper physical query executed--so one logical query could still return cumulatively more rows from the database than the governor limit, if that logical query spawned multiple physical queries that each returned fewer rows than the limit.

     

    Footnote: Moving Work Files to Memory with RAMDisk

     

    If excess data must pass through OBIEE, we want to help it happen as efficiently as possible. One avenue of investigation is to use faster storage for the temporary files, such as SSD or even RAM Disk so that they’re stored directly in memory. This is discussed in detail in Oracle support document 1569118.1.

     

    This is easy to do on Linux, courtesy of /dev/shm. You can configure WORK_DIRECTORY_PATHS in the BI Server NQSConfig.ini, or just create a symlink for the whole OBIEE temp folder, at $FMW_HOME/instances/instance1/tmp (assuming default instance name). After making this change, restart the stack.

     

    I compared the before and after figures of making this change. First I ran a baseline test, the 1.5M row request from the DB:

     

    [2015–12–21T09:52:04.000+00:00]
    Sending query to database named Sample Relational Connection (id: ExecutePhysical Gateway)
    connection pool named Sample Relational Connection
    select * from all_objects a , (select 1 from dual connect by level <= 20) ;
    
    
    [2015–12–21T09:53:18.000+00:00]
    Query Status: Successful Completion [[
    Rows 1495260, bytes 3744131040 retrieved from database query id: ExecutePhysical Gateway [[
    Physical query response time 38 (seconds), id ExecutePhysical Gateway [[
    Physical Query Summary Stats: Number of physical queries 1, Cumulative time 38, 
    DB-connect time 0 (seconds) [[
    Rows returned to Client 1495260 [[
    Logical Query Summary Stats: Elapsed time 74, Response time 35, Compilation time 0 (seconds) [[

     

    The OS metrics show, as expected, a machine busy with disk writes and inbound network traffic:

     

    pa14.png

     

    Now we move the BI Server temp folder to RAMDisk (/dev/shm):

     

    oracle@demo:tmp () $ ll
    total 32
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 Essbase
    drwxr—– 2 oracle oinstall 4096 Jan 27 2014 FinancialReporting
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIClusterControllerComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIJavaHostComponent
    drwxr-x— 4 oracle oinstall 4096 Mar 28 2014 OracleBIODBCComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIPresentationServicesComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBISchedulerComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIServerComponent
    oracle@demo:tmp () $ mv OracleBIServerComponent/ /dev/shm/
    oracle@demo:tmp () $ ln -s /dev/shm/OracleBIServerComponent/ OracleBIServerComponent
    oracle@demo:tmp () $ ll
    total 28
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 Essbase
    drwxr—– 2 oracle oinstall 4096 Jan 27 2014 FinancialReporting
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIClusterControllerComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIJavaHostComponent
    drwxr-x— 4 oracle oinstall 4096 Mar 28 2014 OracleBIODBCComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIPresentationServicesComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBISchedulerComponent
    lrwxrwxrwx 1 oracle oinstall 33 Dec 21 10:05 OracleBIServerComponent -> /dev/shm/OracleBIServerComponent/
    

     

    We run the same test as before:

     

    [2015–12–21T10:11:28.000–00:00]
    Sending query to database named Sample Relational Connection (id: ExecutePhysical Gateway),
    connection pool named Sample Relational Connection, logical request hash 5312c84d, physical 
    request hash 6b3bd34e: [[select * from all_objects a , (select 1 from dual connect by level <= 20) ;
    Query Status: Successful Completion [[
    Rows 1495260, bytes 3744131040 retrieved from database query id: ExecutePhysical Gateway [[
    Physical query response time 37 (seconds), id ExecutePhysical Gateway [[
    Physical Query Summary Stats: Number of physical queries 1, Cumulative time 37, DB-connect time 0 (seconds) [[
    Rows returned to Client 1495260 [[
    Logical Query Summary Stats: Elapsed time 43, Response time 8, Compilation time 0 (seconds) [[
    

     

    Interestingly, the network throughput maximum goes up from c.300Kb/s to c.500Kb/s, which could be because the BI Server is able to write its temp files faster, and thus consume it from the source database over the network more quickly.

     

    There’s still a heavy hit of disk activity because the Presentation Services (sawserver) is still using disk for its temp (we’ve only moved BI Server at this point to RAM Disk). This is confirmed by iotop that shows sawserver writing to disk--at a greater rate than before. This is because either BI Server is able to feed it the data faster, and/or there’s less contention on writes to disk now that BI Server is using memory instead:

     

    TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
    10:21:24 16348 be/4 
    oracle 0.00 B/s 24.27 M/s 0.00 % 0.00 % sawserver

     

    BI Server (nqsserver) writes a temp file to /dev/shm:

     

    Mon Dec 21 10:21:46 GMT 2015
    total 110M
    -rw——- 1 oracle oinstall 110M Dec 21 10:21 nQS_16039_2_37267954.TMP

     

    Of note is that the max size seen for the NQS temp file is 110MB, yet the data returned from the DB is 3.4Gb; either NQS is compressing it, or it’s not storing the whole dataset before it then passes it on to Presentation Services (e.g., the temp file is just a transient point for the data, not a resting place):

     

     

    By using RAMDisk, the runtime is reduced from c.75 seconds to c.45 seconds, because the BI Server is in effect able to use memory instead of disk as a landing point for data returned from the database.

     

    What about if we move Presentation Services temp onto RAMDisk, too?

     

    oracle@demo:tmp () $ ll
    total 28
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 Essbase
    drwxr—– 2 oracle oinstall 4096 Jan 27 2014 FinancialReporting
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIClusterControllerComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIJavaHostComponent
    drwxr-x— 4 oracle oinstall 4096 Mar 28 2014 OracleBIODBCComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIPresentationServicesComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBISchedulerComponent
    lrwxrwxrwx 1 oracle oinstall 33 Dec 21 10:05 OracleBIServerComponent -> /dev/shm/OracleBIServerComponent/
    oracle@demo:tmp () $ mv OracleBIPresentationServicesComponent/
     /dev/shm
    oracle@demo:tmp () $ ln -s /dev/shm/OracleBIPresentationServicesComponent/ OracleBIPresentationServicesComponent
    oracle@demo:tmp () $ ll
    total 24
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 Essbase
    drwxr—– 2 oracle oinstall 4096 Jan 27 2014 FinancialReporting
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIClusterControllerComponent
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBIJavaHostComponent
    drwxr-x— 4 oracle oinstall 4096 Mar 28 2014 OracleBIODBCComponent
    lrwxrwxrwx 1 oracle oinstall 47 Dec 21 10:35 OracleBIPresentationServicesComponent ->
     /dev/shm/OracleBIPresentationServicesComponent/
    drwxr-x— 3 oracle oinstall 4096 Dec 3 2013 OracleBISchedulerComponent
    lrwxrwxrwx 1 oracle oinstall 33 Dec 21 10:05 OracleBIServerComponent -> /dev/shm/OracleBIServerComponent/

     

    Restart the stack and then run the same report:

     

    [2015–12–21T10:37:21.000+00:00]
    Sending query to database named Sample Relational Connection (id: ExecutePhysical Gateway), 
    connection pool named Sample Relational Connection, logical request hash 5312c84d,
     physical request hash 6b3bd34e: [[
    select * from all_objects a , (select 1 from dual connect by level <= 20) ;[2015–12–21T10:38:02.000+00:00]
    Query Status: Successful Completion [[
    Rows 1495260, bytes 3744131040 retrieved from database query id: ExecutePhysical Gateway [[
    Physical query response time 37 (seconds), id ExecutePhysical Gateway [[
    Physical Query Summary Stats: Number of physical queries 1, Cumulative time 37, DB-connect time 0 (seconds) [[
    Rows returned to Client 1495260 [[
    Logical Query Summary Stats: Elapsed time 41, Response time 8, Compilation time 0 (seconds) [[

     

    Notice that the Disk Busy is almost non-existent, and the disk I/O is a degree of magnitude smaller (Kb, not Mb):

     

    sawserver still does some recorded I/O but at a much lower rate:

     

        TIME TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
    10:37:46 259 be/3 root 0.00 B/s 4.94 M/s 0.00 % 0.23 % [jbd2/xvda3–8]
    10:37:46 8677 be/4 oracle 0.00 B/s 1277.92 K/s 0.00 % 0.00 % sawserver

     

    A last point to note on the use of a RAMDisk is that while the alternative is "disk," there will usually be a level of memory involved regardless, by means of the operating system's buffer cache. This is present on both Linux and Windows, as well as other OSs.

     

    Summary

     

    In conclusion:

     

    • Suboptimal design at one or several points in the OBIEE implementation can cause the I/O subsystem on the application server to become overwhelmed. Because I/O throughput is frequently overlooked in assessments of a server’s capacity, it’s easy to be given the "all clear" even when it's unwarranted.
    • Low User CPU does not mean that the system is not under pressure; you must also look at CPU Wait and Idle, as well as the load average.
    • Large reports returned to the end user require lots of memory and disk space for Presentation Services.
    • Lots of data returned from the database requires temp disk space for the BI Server.
    • “Heavy” reports such as these, when run concurrently, can easily exhaust the disk space available to the temp folder, resulting in aborted queries with errors such as :
      Query Status: [nQSError: 46073] Operation ‘write()’ on file ‘/u01/app/oracle/product/
      instances/instance1/tmp/OracleBIServerComponent/coreapplication_obis1/obis_temp/
      nQS_1229_557_52279963.TMP’ failed with error: (2) (.
      [nQSError: 46017] Sort has no work space in the work directory.
      [nQSError: 43119] Query Failed:
    • Lots of data moving throughout the stack means network and disk I/O throughput. If you see this behavior with the execution of a single report, you must consider whether you have the necessary bandwidth on the server when concurrent requests of similar nature run, and also what this resource utilization means for other functions on the server (e.g., co-located ETL).

     

    If you have reports that are behaving like this, the solution is to get them to move less data around. This could be returning fewer rows to the end user, through better use of report predicates as well as analysis of the use of the data with the use to understand if it could be done in a more optimal way. Alternatively, the solution is to present the same data but to do so more efficiently. Efficiency improvements are usually to be found in the design of the report (for example, look for complex calculated columns   that would be better pushed to the RPD or database), as well as more generally in the RPD and database by storing and modelling the data in a way that supports the types of queries being run more accurately. A final suggestion: where appropriate, schedule bigger reports to run out of hours through BI Delivers, thus reducing the risk of them impacting online users during the day.

     

    Further Reading and References

     

    The documentation for OBIEE includes very useful sections on OBIEE performance:?

     

     

    Oracle has published white papers available through My Oracle Support that detail further options available within the OBIEE stack for tuning performance:

     

     

    The following article on the Rittman Mead blog describes further optimization possibilities for OBIEE, including use of a Web Tier: Optimising OBIEE performance. An associated presentation is available: No Silver Bullets - OBIEE Performance in the Real World

     

    Last, but very definitely not least, a most important paper for the understanding of performance optimization work in general is Cary Millsap's Thinking Clearly About Performance?.

     

    About the Author

     

    Oracle ACE Robin Moffatt is the head of R&D (Europe) at Rittman Mead.

     


     

    This article has been reviewed by the relevant Oracle product team and found to be in compliance with standards and practices for the use of Oracle products.