Streamlining Infrastructure Monitoring And Metrics In It- Db-Ims [report]

Charles Callum Newey, Giacomo Tenaglia, Artur Wiecek
2015 Zenodo  
Project Specification Basic Systems Monitoring This is now done with a CERN-made custom collector which sends data in the form of "notifications" via Apache Flume to HDFS for long-term storage. The notifications are also sent to ElasticSearch and displayed with Kibana (à la Splunk). Due to limitations of the architecture, the system data is collected every 5 minutes which is not ideal. The idea is to implement a solution which allows more fine-grained sampling of system metrics. Possible ideas
more » ... re OpenTSDB (to leverage the existing Hadoop infrastructure) or prometheus.io, which should be simpler to setup but it only scales out by sharding. OpenTSDB initially seems like a more promising solution, so investigate the various collection and display alternatives. Logs Management and Centralisation. This is now done only for syslog with Apache Flume - shipping to HDFS (and kept "forever") and to Elasticsearch/Kibana, with a 1 month retention time. There are two issues: flexibility of the collection process, and authorisation. On the flexibility side, collected messages need to be split into different fields before being stored in HDFS/Elasticsearch, in order to ease the data mining process. Logstash and grok are potentially promising solutions. On the authorisation side, the idea here is to specifically target the Weblogic installations in order to expose to clients their application logs in a convenient way, and as we host very different applications with different confidentiality levels (amazing what can be found in some application logs!) we need to put an authorisation layer on top of HDFS and ElasticSearch (Kibana just being JS querying ElasticSearch directly). For this there are some methods that could be implemented on HDFS, and for ElasticSearch there is a FOSS plugin to be checked. This part would probably involve setting up an ElasticSearch cluster first. Abstract There are a number of problems with the current monitoring infrastructure in IT-DB which currently make it difficult to diagnose certain [...]
doi:10.5281/zenodo.31862 fatcat:zek2ncfcmzfrzgonevxpbck5ri