obis - Varying logical execution times on loaded cluster
Having some issues with an OAS deployment where we are seeing significant variation and extension of logical query execution time over the runtime of a given OBIS node.
Generally what I observe from usage tracking is a trend of growth of logical execution time for all analysis' which have extensive logical computation requirements. The more users that are connected and using the system, and the longer the uptime of the OBIS seem to factor into how much longer the logical executions take.
To get a more focused look at the issue I grabbed and example sql that was seeing significant degradation in performance and saved it off prepending the following to remove as many X factors as possible and limit the test to only the logical execution time in the nqsserver.
"SET VARIABLE DISABLE_PLAN_CACHE_HIT=1,SKIP_PHYSICAL_QUERY_EXEC=1, LOGLEVEL=5, DISABLE_CACHE_HIT=1,DISABLE_CACHE_SEED=1,DISABLE_PLAN_CACHE_SEED=1:SELECT
.... "
I then throw this at the server via:
nqcmd.sh -q -H -T -d AnalyticsWeb -s Proj_performance_REFRESH_CACHE_SKIP.logicalsql -impersonate EXAMPLEUSER
The expected result is about 8-9 seconds For example the summary of the output would be like this:
-------------------------------
Clock time: batch start: 13:00:29.000 Query from: 13:00:29.000 to: 13:00:38.000 Row count: 0
total: 8.914 prepare: 0.557 execute: 8.329 fetch: 0.028
Cumulative time(seconds): Batch elapsed: 9 Query total: 8.914 prepare: 0.557, execute: 8.329, fetch: 0.028, query count: 1, cumulative rows: 0
Now in our problematic cluster, after a given OBIS has been online for a longtime, or during times that its under heavy load we see huge swings in how long the logical only portion takes to complete.
-------------------------------
Clock time: batch start: 16:23:17.000 Query from: 16:23:18.000 to: 16:23:43.000 Row count: 0
total: 25.096 prepare: 0.798 execute: 24.284 fetch: 0.014
Cumulative time(seconds): Batch elapsed: 25 Query total: 25.096 prepare: 0.798, execute: 24.284, fetch: 0.014, query count: 1, cumulative rows: 0
If restart the OBIS, the same query then runs in the expected time once again.
By injecting the various variable at the start of the SQL it rules out the following being factors:
-Execution Plan Cache
-Query Cache
-Cache Seeding
So to jump to finer levels of detail, I look at the log 5 trace level, and compare the good and bad runs:
In both cases we see there were 23 physical queries, 22 joins, 7 merge joins etc.
Bad run:
[2022-05-12T14:07:34.869-06:00] [OBIS] [TRACE:5] [] [] [ecid: 005rhORpD2h1zWX_Ptg8yd000Czq0007JR,0] [sik: bootstrap] [tid: 72f2700] [messageid: USER-29] [requestid: fd590002] [sessionid: fd590000] [username: REDACTED] -------------------- Physical Query Summary Stats: Number of physical queries 23, Cumulative time 0.000, DB-connect time 0.003 (seconds)^M
[2022-05-12T14:07:34.869-06:00] [OBIS] [TRACE:5] [] [] [ecid: 005rhORpD2h1zWX_Ptg8yd000Czq0007JR,0] [sik: bootstrap] [tid: 72f2700] [messageid: USER-70] [requestid: fd590002] [sessionid: fd590000] [username: BRHAYDUK] -------------------- Number of joins executed internally for logical request hash f7a224b6 : 22, number of sort merge join 7, number of hash join 0, number of pnlj join 0, number of eager join 15, number of index join 0, number of full outer join 15, number of left outer join 0, number of inner join 0, number of cross join 7^M
Good run:
2022-05-12T14:07:27.467-06:00] [OBIS] [TRACE:5] [] [] [ecid: 005rhOROyM93FCX_PtH7id000PrD004E_s,0] [sik: bootstrap] [tid: 2fdab700] [messageid: USER-29] [requestid: 8e680002] [sessionid: 8e680000] [username: BRHAYDUK] -------------------- Physical Query Summary Stats: Number of physical queries 23, Cumulative time 0.000, DB-connect time 0.003 (seconds)^M
[2022-05-12T14:07:27.467-06:00] [OBIS] [TRACE:5] [] [] [ecid: 005rhOROyM93FCX_PtH7id000PrD004E_s,0] [sik: bootstrap] [tid: 2fdab700] [messageid: USER-70] [requestid: 8e680002] [sessionid: 8e680000] [username: REDACTED] -------------------- Number of joins executed internally for logical request hash f7a224b6 : 22, number of sort merge join 7, number of hash join 0, number of pnlj join 0, number of eager join 15, number of index join 0, number of full outer join 15, number of left outer join 0, number of inner join 0, number of cross join 7^M
So based on that my observation is that what the query engine is doing when the performance is good vs bad is not changing.
So to dive right into it lets jump into my analysis of the level 7 trace:
-In level trace 5 we see this is the main difference between good performing runs and bad performing runs:
Step doing the activity "Logical Request (before navigation)"
[2022-05-20T13:08:15.40-06:00] [OBIS] [TRACE:7] [] [] [ecid: 005rrOqN1n21zWX_Pth8id0002QZ00088i,0] [sik: bootstrap] [tid: 944d6700] [messageid: USER-2] [requestid: 16750002] [sessionid: 16750000] [username: REDACTED] -------------------- Logical Request (before navigation): [[^M
The level 7 trace provides more details in what comes next following this step and gives me more detail into what is slowing down. We see a number of selects on Dimensions that are doing lookups for the query. All of these small steps instead of taking almost 0 time begin to take up to closer to 1 second to complete "List of attributes and their LTS sources"
[2022-05-20T13:08:22.10-06:00] [OBIS] [TRACE:7] [] [] [ecid: 005rrOqPytl1zWX_Pth8id0002QZ00088o,0] [sik: bootstrap] [tid: 944d6700] [messageid: USER-53] [requestid: 16750002] [sessionid: 16750000] [username: BRHAYDUK] -------------------- List of attributes and their LTS sources: [[^M
Select: Dim - Date Fiscal Calendar.Fiscal Period Wid: Logical table sources (Priority=0, SystemGenerated=False): Dim - Date Fiscal Calendar.Dim_W_MCAL_PERIOD_D_Fiscal_Period
So the trace 7 gives me a good idea of specifically what is taking longer, and it appears to be it navigating or compiling an RPD translation.
At the system level if things are starting to slow down at volume, I would expect something like exhaustion of some threadpool would be happening or that perhaps we are tapping some memory limit. The diagnostic dumps dont really yield any useful information as to what resource is being constrained, and it seems to be to specific within the NQ server to get hits on.
Generally the hosts the cluster runs on remain below 10% cpu usage, so at face value it appears the NQ server starts to run out of something before the actual CPU resources do.
I've reviewed pretty much all the settings in NQSConfig.INI and have tuned and tweaked everything I can think of to alleviate the issue but cannot come to a root cause of what is being exhausted.
Also been monitoring the proc for the nqsserver for some time and have been doing regular pstack dumps to try and get a better idea from the OS level what nqserver is doing but the threads dumps seem to be fairly consistent with about 300 threads and about 70 threads in lll_lock status.
Recycling the OBIS nodes definitely resolves the problem, but I'm just clear on what the restart does that then allows the nqsserver to process some much faster when freshly started.