DC 2015 - Program


Ready for more DevOpsDays?

DevOpsDays DC will return June 8-9, 2016.

Instrumentation of Alerting & Trend Data

Facilitator: Kevin Ege

  • Some use Nagios but limited metrics to collect
  • Scount is a tool that was recommended
  • Big concern: How do you reduce False Positives while still catching relevant issues
  • We encourage dev team to put in metrics during dev time and identify failure conditions
  • Example: Web site activity falling below certain levels but if happens on a Sunday and get false postiive
  • There are types of issues that are difficult to catch that are not about latency but the content is not right which is hard to detect
  • You have to uncouple development and code from architecture and service uptime
  • Developers won't always know what the impact of a node failure would be on a service
  • Alerting burden falls on ops but in concert with unit-tests that developers run during testing
  • Example: front developers won't relate to IOPS alerts or database connection issues
  • There are differnet domains of monitoring: Page load time vs. database monitoring
  • If there are failures in the back-end but is not affecting user experience issues then it should not be escalated
  • You can alert a DBA 3 days in advance but only the DBA who would know what to do with it before an outage happens
  • In an ephemeral infrastructure, you can restart the whole app and it may help certain issues but not all
  • Analytics software like Skyline doesn't produce good results
  • Advanced statisics, machine learning, correlation is the intent
  • You can blow away nodes based on statistics with the right architecture
  • Some tried to write own statistics but hard to get good results
  • Some have a lot of provisioned IOPS and reserved instances to help with performance issues
  • We trend data and as Subject Matter Experts to review the graphs and pick up on hints on spikes
  • We go back 3 years and use RRD
  • We keep 1 year full granulatiry and then aggregated data after that
  • Not retaining a tremendous amount of granular data
  • Alerting on COGS factors with AWS instances
  • Avoid baked in static thresholds
  • Uncouple data instrumentation from alert logic

Platinum & Venue Sponsor


Platinum Sponsors

Excella Consulting Sumo Logic Netuitive Ansible Chef CustomInk Delphix Red Hat Elastic VictorOps Fugue Sonatype FireEye

Gold Sponsors

InfoZen Comcast Circonus Opus Group, LLC BlackMesh PagerDuty govready

Silver Sponsors

Puppet Labs