This discussion is archived
12 Replies Latest reply: Oct 5, 2012 8:50 AM by sabdelhalim RSS

Performance Issue

User387251 Newbie
Currently Being Moderated
Hi,

I am looking into way to improve the performance of dgraph running on Amazon EC2 Linux. We are using v6.1.3.
What we are observing is the dgraph processing the query fast between 20ms to 800ms when we put a load of 50users with 2000 requests in 60seconds. However dgraph takes time to write the response to the socket.
Again, it is only for some queries. In the dgraph reqlog I can see few entries that 80ms to process the request but 1800ms to finish prcossing the request.

I am running 4 threads with single dgraph.

Please anyone suggest me a way to improve the dgraph or server socket/network performance.

Thanks,
RP

Edited by: 924798 on 03-Sep-2012 21:33
  • 1. Re: Performance Issue
    User387251 Newbie
    Currently Being Moderated
    Hi,

    Please suggest any solution for this issue.

    Thanks,
    RP
  • 2. Re: Performance Issue
    brettr Journeyer
    Currently Being Moderated
    Hi,

    You mention that the latency is higher only for some queries - try and identify and isolate those queries if you can. Some of the performance testing/monitoring tools and the engine stats page should help with that.

    There are a number of possibilities, but one of the common causes of a large engine-time/round-trip-time differential is large response payloads. This can be large individual records, or many refinement values being returned (check, have you exposed too many or all refinements?). Look for outlier large records, and if it's refinement values, make sure you are only requesting the ones you need, and consider implementing dynamic refinement.

    Another possibility is slow query response reading from a busy appserver, so isolate that in your testing also by loading the dgraph directly.

    Best
    Brett
  • 3. Re: Performance Issue
    User387251 Newbie
    Currently Being Moderated
    Hi Brett,

    Thanks for the reply, and I will look into the way to reduce response size. Also when the dgraph server is not in use or in less load, sometime the simple health check through api is taking time to get the response back. When I looked into the dgraph request log, I found dgraph is taking time to send the response

    1346894138039 *.*.*.* - 204057 321 29041.27 0.03 200 - 0 1 /admin?op=ping - X-Forwarded-For:

    Also there is a load balancer between application and two dgraph servers.

    Thanks again,
    Regards,
    RP
  • 4. Re: Performance Issue
    953835 Newbie
    Currently Being Moderated
    Which of the numbers in that is the response time?
  • 5. Re: Performance Issue
    953835 Newbie
    Currently Being Moderated
    This has lots of information. Maybe something here will help narrow it down.
    http://<server>:<dgraph-port>/admin?op=stats
  • 6. Re: Performance Issue
    User387251 Newbie
    Currently Being Moderated
    Hi,


    In that *29041.27* *0.03* are response time,


    *29041.27* - is total request time

    *0.03* - is total processing time

    Processing time is faster however for some reason total request time is way higher for a health check and that during when there is not much load on the server.

    Thanks,
    RP
  • 7. Re: Performance Issue
    brettr Journeyer
    Currently Being Moderated
    I see there's a X-Forwarded-For header in the ping request. Is the ping request being made from the load-balancer, or through the load-balancer? Are you running any forward or reverse-proxy between the appserver and the MDEX? What type of load-balancer is it? Has the load-balancer and network components between the appserver and MDEX been analysed/tested for capacity, errors etc.?

    Best
    Brett
  • 8. Re: Performance Issue
    PravinChikhale Explorer
    Currently Being Moderated
    You can also check the Hit rate column in Main Cache section in stats page, whether MDEX using cache for serving the result or is it making hit to dgraph. You can tune it for better performance using --cmem* flag of dgraph.

    Also you can use dgraph analysis tool like cheetah script and check if queries are spending too much time in queue. Better use production logs for this analysis. If queries are spending too much time in queue then you can probably tune the number of dgraph threads with respect to hardware configuration available.

    As suggested by brett, its always better to tune the query by removing unnecessary part or you can decrease the response size by using setSelection method available in ENEQuery.
    http://docs.oracle.com/cd/E28910_01/PresentationAPI.622/apidoc/javadoc/navigation/com/endeca/navigation/ENEQuery.html#setSelection(com.endeca.navigation.FieldList)

    HTH

    Thanks,
    Pravin.
  • 9. Re: Performance Issue
    953835 Newbie
    Currently Being Moderated
    the dgraph request log
    That time stamp is independent of any "network". Any network slow down(s) would be added on top of that time stamp. Or I am not understanding some thing basic about this topology.
    And ... for a ping taking 29 seconds I would just open an issue with Oracle directly. IMO (No idea how to do that.)

    Edited by: EndecaJoe on Sep 7, 2012 6:03 AM
  • 10. Re: Performance Issue
    User387251 Newbie
    Currently Being Moderated
    The version we are using here is 6.1.3.

    The documentation says,

    Total Request Time - The request lifetime, in milliseconds. Equal to the total amount of time between when the Dgraph reads the request from the network and finishes sending the result. May include queuing time, such as
    time spent waiting for earlier requests to be completed.
    Note: In previous releases, the request lifetime ended when the connection was closed. If connection close did not time out, this lifetime would include the time to transport the response to the
    client, and the time for the client to read the response. Starting with 6.1.0, the request lifetime ends when the response has been successfully delivered to the socket layer.


    We are doing healthcheck from two places,
    1. By Load Balancer - which uses direct ping through url -> http://host:port/admin?op=ping
    2. By Application using endeca api which goes through Load Balancer -> this request usually takes longer to get response.

    The one send by load balancer is faster and generally no issues at all:

    timestamp *.*.*.* - 69927 320 *0.39 0.05* 200 - 0 1 /admin?op=ping - Host

    However one send through load balancer by api is generally takes time. during this weekend we left the server with no load and just left with healthcheck. Below are the entries in dgraph logs. maximum it takes 29s to write the response and most of the times it took 400ms to write respionse.

    1347233221020 *.*.*.* - 180794 321 *456.87 0.05* 200 - 0 1 /admin?op=ping - X%2DForwarded%2DFor
    1347233221263 *.*.*.* - 180799 0 0.18 0.00 0 - 0 0 - -
    1347233221758 *.*.*.* - 180800 321 *0.47 0.06* 200 - 0 1 /admin?op=ping - Host
    1347233231641 *.*.*.* - 180788 209 *30010.75 0.00* 408 - 0 0 - -
    1347233246206 *.*.*.* - 180802 0 0.16 0.00 0 - 0 0 - -
    1347233246850 *.*.*.* - 180803 321 *0.38 0.04* 200 - 0 1 /admin?op=ping - Host
    1347233250380 *.*.*.* - 180798 321 *29359.01 0.05* 200 - 0 1 /admin?op=ping - X%2DForwarded%2DFor

    Thanks,
    RP
  • 11. Re: Performance Issue
    User387251 Newbie
    Currently Being Moderated
    Hi Brett,

    We have done load test on the appserver generally 90% of the queries took less than a second to send response back. But some queries took more than 10seconds; when checked dgraph log, processing time is with in 100ms but dgraph took time to write the response. Also we have done eneperf testing on the dgraph, and 100% of the quries ran with in a second.

    But with no load, for a healthcheck through load balancer using api is taking more time from 400ms to 29second.

    Regarding load balancer, it is a stateless internal load balancer configured for round robin. Also I am seeing lots of 408 entries in the dgraph logs but with no clear information
    like, 1347233231641 *.*.*.* - 180788 209 30010.75 0.00 408 - 0 0 - -

    Thanks,
    RP
  • 12. Re: Performance Issue
    sabdelhalim Newbie
    Currently Being Moderated
    Hi,
    just a short comment, i think I reand somewhere in an official documentation that the appropriate algorithm for you LB is "Least Connections"

    maybe you should try running Eneperf with a small extract of a recent reqlog file against the LB (vip) and compare your stats with running it against individual Dgraphs to see if you have behaviour issues with your LB

    regards
    Saleh

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points