(This is a double post from the Endeca Experience Management forum)
I am looking for best practice on how to define Endeca service outage and monitor the health of the system. I understand this depends on your user requirements and it may vary from customer to customer. Specifically what criteria do you use to notify your engineer there is a problem? We have our load balancers pinging dgraphs on an interval. However the ping operation is not sufficient in our use case. We are also experimenting running a "low cost" query to the dgraphs on an interval and using some query latency thresholds to determine outage. I want to hear from people on the field running large commercial web site about your best practice of monitoring/notifying health of the system.
The performance metric should help to analyse the query and metrics for fine tuning.
Here are few best practices:
1. Reduce the number of components per page
2. Avoid complex LQL queries
3. Keep the LQL threshold small
4. Display the minimum number of columns needed