I have a product for which I would like to create a dashboard to show its availability/uptime over time and display any outages.
Specifically I am looking for
- ability to report historical information on service uptime
- provide details on any service outages
The product is running on a fleet of linux servers and connects to a DB running on a separate instance, also we have some dedicated instances that run nightly batch jobs. My system also relies on some external services to provide additional functionality for select customers. There is redis cache also for caching data for multiple customers.
We replicate all the above setup (application servers, DB, jobs servers, redis cache etc) into dedicated clusters for large customers. Small customers are put on one of the shared clusters to keep costs low.
Currently we are running health checks on application servers only and providing that information in a simple HTML page. This is a go to page for end-users/customers and support teams.
Since the product is constructed using multiple systems/services our current HTML page often times says that the system is up and running fine while can be experiencing issues with some of its components or external services.
Current health check is using a simple HTTP request and looks for a 200 status code, this check runs every minute and we plot this data into a simple chart to show last 30 days. We also show a list of outages with timestamp and additional static information that is manually added.
We would like to build a more robust solution that monitors much more than the HTTP port and where we have more details like what part of the system is having issues and how those issues are impacting the system and which customers are impacted.
Appreciate any guidance or help. We prefer to build the solution using open source tools since we dont have much budget. Goal is to improve things for my team members who are already overloaded.
I'm not sure if this will be overkill or not for your setup, given that I don't know your product, but have a look at the ELK Stack and see if you can use some components or at least some ideas from there: