I am looking into way to improve the performance of dgraph running on Amazon EC2 Linux. We are using v6.1.3.
What we are observing is the dgraph processing the query fast between 20ms to 800ms when we put a load of 50users with 2000 requests in 60seconds. However dgraph takes time to write the response to the socket.
Again, it is only for some queries. In the dgraph reqlog I can see few entries that 80ms to process the request but 1800ms to finish prcossing the request.
I am running 4 threads with single dgraph.
Please anyone suggest me a way to improve the dgraph or server socket/network performance.
Edited by: 924798 on 03-Sep-2012 21:33
You mention that the latency is higher only for some queries - try and identify and isolate those queries if you can. Some of the performance testing/monitoring tools and the engine stats page should help with that.
There are a number of possibilities, but one of the common causes of a large engine-time/round-trip-time differential is large response payloads. This can be large individual records, or many refinement values being returned (check, have you exposed too many or all refinements?). Look for outlier large records, and if it's refinement values, make sure you are only requesting the ones you need, and consider implementing dynamic refinement.
Another possibility is slow query response reading from a busy appserver, so isolate that in your testing also by loading the dgraph directly.
Thanks for the reply, and I will look into the way to reduce response size. Also when the dgraph server is not in use or in less load, sometime the simple health check through api is taking time to get the response back. When I looked into the dgraph request log, I found dgraph is taking time to send the response
1346894138039 *.*.*.* - 204057 321 29041.27 0.03 200 - 0 1 /admin?op=ping - X-Forwarded-For:
Also there is a load balancer between application and two dgraph servers.
In that *29041.27* *0.03* are response time,
*29041.27* - is total request time
*0.03* - is total processing time
Processing time is faster however for some reason total request time is way higher for a health check and that during when there is not much load on the server.
I see there's a X-Forwarded-For header in the ping request. Is the ping request being made from the load-balancer, or through the load-balancer? Are you running any forward or reverse-proxy between the appserver and the MDEX? What type of load-balancer is it? Has the load-balancer and network components between the appserver and MDEX been analysed/tested for capacity, errors etc.?
You can also check the Hit rate column in Main Cache section in stats page, whether MDEX using cache for serving the result or is it making hit to dgraph. You can tune it for better performance using --cmem* flag of dgraph.
Also you can use dgraph analysis tool like cheetah script and check if queries are spending too much time in queue. Better use production logs for this analysis. If queries are spending too much time in queue then you can probably tune the number of dgraph threads with respect to hardware configuration available.
As suggested by brett, its always better to tune the query by removing unnecessary part or you can decrease the response size by using setSelection method available in ENEQuery.
That time stamp is independent of any "network". Any network slow down(s) would be added on top of that time stamp. Or I am not understanding some thing basic about this topology.
And ... for a ping taking 29 seconds I would just open an issue with Oracle directly. IMO (No idea how to do that.)
Edited by: EndecaJoe on Sep 7, 2012 6:03 AM
The version we are using here is 6.1.3.
The documentation says,
Total Request Time - The request lifetime, in milliseconds. Equal to the total amount of time between when the Dgraph reads the request from the network and finishes sending the result. May include queuing time, such as
time spent waiting for earlier requests to be completed.
Note: In previous releases, the request lifetime ended when the connection was closed. If connection close did not time out, this lifetime would include the time to transport the response to the
client, and the time for the client to read the response. Starting with 6.1.0, the request lifetime ends when the response has been successfully delivered to the socket layer.
We are doing healthcheck from two places,
1. By Load Balancer - which uses direct ping through url -> http://host:port/admin?op=ping
2. By Application using endeca api which goes through Load Balancer -> this request usually takes longer to get response.
The one send by load balancer is faster and generally no issues at all:
timestamp *.*.*.* - 69927 320 *0.39 0.05* 200 - 0 1 /admin?op=ping - Host
However one send through load balancer by api is generally takes time. during this weekend we left the server with no load and just left with healthcheck. Below are the entries in dgraph logs. maximum it takes 29s to write the response and most of the times it took 400ms to write respionse.
1347233221020 *.*.*.* - 180794 321 *456.87 0.05* 200 - 0 1 /admin?op=ping - X%2DForwarded%2DFor
1347233221263 *.*.*.* - 180799 0 0.18 0.00 0 - 0 0 - -
1347233221758 *.*.*.* - 180800 321 *0.47 0.06* 200 - 0 1 /admin?op=ping - Host
1347233231641 *.*.*.* - 180788 209 *30010.75 0.00* 408 - 0 0 - -
1347233246206 *.*.*.* - 180802 0 0.16 0.00 0 - 0 0 - -
1347233246850 *.*.*.* - 180803 321 *0.38 0.04* 200 - 0 1 /admin?op=ping - Host
1347233250380 *.*.*.* - 180798 321 *29359.01 0.05* 200 - 0 1 /admin?op=ping - X%2DForwarded%2DFor
We have done load test on the appserver generally 90% of the queries took less than a second to send response back. But some queries took more than 10seconds; when checked dgraph log, processing time is with in 100ms but dgraph took time to write the response. Also we have done eneperf testing on the dgraph, and 100% of the quries ran with in a second.
But with no load, for a healthcheck through load balancer using api is taking more time from 400ms to 29second.
Regarding load balancer, it is a stateless internal load balancer configured for round robin. Also I am seeing lots of 408 entries in the dgraph logs but with no clear information
like, 1347233231641 *.*.*.* - 180788 209 30010.75 0.00 408 - 0 0 - -
just a short comment, i think I reand somewhere in an official documentation that the appropriate algorithm for you LB is "Least Connections"
maybe you should try running Eneperf with a small extract of a recent reqlog file against the LB (vip) and compare your stats with running it against individual Dgraphs to see if you have behaviour issues with your LB