Incremental change and release often is the new mantra. A team of one may handle the operability implementation for an application. Work arounds are done to get things out the door “on time” to meet the agile requirements. This person is in the center of changes ensuring that monitoring, configuration, and service expectations are coordinated. As the velocity of change increases, this single point of failure leads to slow downs. The operations engineer is viewed as the “hero” due to working late nights or weekends to keep the service running as needed. This false hero role is not sustainable and creates a rigid environment of change aversion.
In this talk, I will describe paths to supporting complex project deployment and configuration from manual heroics to minimal intervention. I will describe tools but the concepts will be applicable regardless of environment. This is not a one size fit all solution but guidelines presented should help shape direction towards success.
DevOps asks us to break down the walls between our teams in order to collaborate better. It asks us to embrace tools that provide better transparency into what our systems are doing. It asks us to figure out how to learn about problems during development rather than in production. More than anything, I believe DevOps asks us to solve the following:
How do we improve the feedback loops in our organizations?
From my experiences in introducing DevOps methodologies to large enterprises shipping on-prem products to teams building SaaS applications, I've found that figuring out how to improve your teams' feedback loops is a common problem that has high impact results when solved.
In this talk, I will present my ideas around why long feedback loops are a problem and why it is so important to improve them in your organization. I will then go on to talk about examples of where the lack of feedback has caused problems for my teams, from code changes that broke QA or emergency product demos, and explain how we have fixed these issues to make our team perform better.
As an engineering team you have a solid delivery pipeline. You check in code triggering the CI build: runs unit tests, uploads to an artifact repo, runs functional tests, deploys to UAT/Stage, and finally after signoff you deploy to production.
You have come a long way, though now you have a new problem: cloud management.
There are use cases which require production level environment (RAM, CPU, DB, HDD, etc) for application development and validation. Maybe the application doesn’t run well locally on engineers’ laptops. Maybe it needs “bigdata”. Maybe it’s sales demo, or performance testing, or maybe you haven’t fully moved to vagrant/docker for this app.
In true DevOps fashion, any engineer should be able to (re)provision a service in the cloud on-demand. However, you have neither unlimited funds nor unlimited time and thus want to use this metered service efficiently. The engineering team needs to know who is using existing services, what state the service is in, and the ability to provision new services if necessary.
We will cover our journey of trying to address this challenge of self-service cloud management and deployment. At the beginning of 2013 we surveyed what tools were available. Over the next year, we extended an existing open source tool (knife-ec2) for better AWS provisioning as well as creating a new tool Atlas for self-service cloud management and deployment.
As more organizations start to embrace the DevOps culture, there are very unique hurdles that each must overcome. One of the major hurdles we have had to overcome is the ability to verify our infrastructure prior to rolling out each of the pieces. The two major pieces of this are idempotent provisioning and the ability to monitor all necessary aspects of the environment. For this we utilize Chef and Sensu respectively, allowing us to have with pretty good certainty as to what state the systems are in and how well they are running. There will always be cases were we can not run these tools or the tools are not able to provide all of the necessary information, putting us into a grey area with the amount of confidence we have in the system.
To work around these limitations other organizations may have scripts run via cron or other home grown solutions to provide their confidence, at Yieldbot we have openly embraced ServerSpec as our primary solution.
Use of ServerSpec has easily allowed us to achieve the following:
The increasing popularity of Devops the past few years signifies an organic transition in the IT industry. A remarkably similar transition has been slowly ongoing in multiple industries and fields, from academia to finance and even policy. We have begun to realize the paramount importance of effective interdisciplinary collaboration as the only solution for the immense complexity of the next generation of challenges. This requires a highly systemic approach that transcends the traditional organizational barriers. Progressive, innovative, and brilliant products require desiloisation, which creates conflict. The word conflict may include cultural or lingual connotations of violence and destruction, however once we harness its energy and steer it in a productive manner then we can maximize our collective efficiency. Devops is all about harnessing the power of conflict. This talk will focus on the significance of effective conflict resolution for Devops based on basic concepts of social psychology and systems thinking. It will begin by addressing the role of each individual member and transition into intra and inter group dynamics by introducing the audience to notions of cooperation and competition, the value of intellectual opposition, trust, and positive interdependence.
The last decade belonged to virtual machines, the next one belongs to containers.
Virtualization lead to an explosion in the number of machines in our infrastructure and we were all caught off guard. It turns out that those shell scripts did not scale after all. Lucky for us configuration management swooped in to save the day.
But a new challenge is on the horizon. We are moving away from node based infrastructures where hostnames are chosen with care. Gone are the days of pinning one service to a specific host and wearing a pager in case that host goes down. Containers enable a service to run on any host at any time. Our current set of tools are starting to show cracks because they were not designed for this level of application portability. It’s time to get ahead of the curve and take a look at new ways to deploy and manage applications at scale.
Introducing CoreOS...CoreOS is a new Linux distribution designed specifically for application containers and running them at scale. This talk will examine all the major components of CoreOS including etcd, fleet, docker, and systemd; and how these components work together to solve the problems of today and tomorrow.
In the course of their day-to-day work, our development team actively relies on our metrics platform to confidently ship code to production and debug problems. They measure and correlate behavior between services on live production workloads, use real-time data to reason and hypothesize about production problems, and add or modify metrics and instrumentation in production to prove out their assumptions. Our own success in utilizing the metrics stream from production to close our engineering feedback loop, has convinced us that this, practice, which we describe as Metrics Driven Development (MDD), is a requirement of building web-scale systems. It is a discipline that should be implemented by development teams alongside other development paradigms like Test-driven-development (TDD) and Behavior-Driven-Development (BDD).
Our talk will recount an episode where we employed MDD to diagnose an actual problem encountered in our production system running at scale. The audience will follow as the developer initially identified an anomaly in a production KPI metric, developed a hypothesis as to the cause of the anomaly, added instrumentation to the code in question and finally confirmed the original hypothesis through observation of real-time metrics. Along the way we’ll include references to specific tools and best practices that developers can adopt in their own MDD efforts. We’ll also demonstrate that MDD does not replace traditional debugging approaches like request logging or code profiling, but can often help narrow the focus of those efforts, which can be expensive or difficult to perform in web-scale systems.
This talk is a synthesis of cultural transformation, concrete engineering techniques, systems monitoring, scientific observation, and post-mortem. It will prove intellectually gratifying and valuable to anyone who is writing and shipping code to production systems, even if they are already following an MDD model. They’ll learn what requirements a metrics platform needs to support MDD, how to add lightweight instrumentation to code, and how to isolate problems by using metrics derived from that instrumentation. The audience will also see how MDD can be used in addition to traditional production debugging practices, and will come away with an understanding of how to ship better software through the use of MDD.
Graphite and Statsd are indispensable components of the modern DevOps stack. Companies such as Etsy have demonstrated that instrumenting your business and being a data driven organization can improve the lives of your teams and be useful to help improve your products and your customers' experience.
Unfortunately running Graphite at scale is non-trivial. Acquia has matured over the years in its internal usage of Graphite and has learned many lessons along the way.
Come learn how we have scaled Graphite using Cassandra to store millions of data points all the while giving back to open source.