DC 2015 - Program


Ready for more DevOpsDays?

DevOpsDays DC will return June 8-9, 2016.

Monitoring Best Practices

Facilitator: Steve Bannon

Scribe: Joanne Garlow

Monitoring should help towards the bottom line - counter - the money you make in 6 months might depend on that you are doing today

Monitoring by working with the developers to add custom parameters into the APIs and using log messages in case of issues to trace back problems to the root cause

Top to bottom view of the whole system is critical

Reviews of key monitoring after deployments

Tools - New Relic, cacti, zabbix, mysql monitor, graphite, influxDb, logstash, splunk, solarwinds, PagerDuty (highly recommended), nagios - people have a love/hate relationship icinga - fork of nagios

Tracing issues requires thoughtful logging in advance

Use logstash to get all different logs together roughly correlated in time

Key is getting all the logs in the same place Big challenge is getting all the groups to feed all their data into the same place *Logs can be helpful when you are missing some key monitoring - but after you hit a problem, start monitoring

Rolling up monitoring and metrics to get something that executives care about

Cucumber - write your behavior specifications at a high business level and then instrument monitoring around these requirements in a systematic way NewRelic was mentioned as another way to do this.
Nontechnical person can write cucumber scripts

Monitor everything you care about butonly alert when it is actionable - don't page someone for something that can wait Don't let there be so much monitoring that it makes noise that people are ignoring Get the right level of notifications on issues - if it can wait - put it in an email Use something like VictorOps to mange the "actionable alerts" and make sure the right people get the alerts... to avoid alert fatigue and ignored pages.

Transparency in what is getting monitored

You have to really think about the monitoring tools because you get pretty locked into whatever you choose for a while

Need to throttle logs wisely to prevent getting too many of the same error without hiding the extent of the error

Platinum & Venue Sponsor


Platinum Sponsors

Excella Consulting Sumo Logic Netuitive Ansible Chef CustomInk Delphix Red Hat Elastic VictorOps Fugue Sonatype FireEye

Gold Sponsors

InfoZen Comcast Circonus Opus Group, LLC BlackMesh PagerDuty govready

Silver Sponsors

Puppet Labs