A few months ago, we wrote about how the first step to implementing Site Reliability Engineering (SRE) in an organization is getting leadership on board. So, let’s assume that you’ve gone ahead and done that. Now what? What are some concrete steps you can take to get the SRE ball rolling? In this blog post, we’ll take a look at what you as an IT leader can do to fast-track SRE within your team.
Step 1: Start small and iterate
“Rome wasn’t built in a day,” the saying goes, but you do need to start somewhere. When it comes to implementing SRE principles, the approach that I (and my team) found to be the most effective is to start with a proof of concept, learn from our mistakes, and iterate!
Start by identifying a relevant application and/or team
There are many factors that go into choosing a specific team or application for your SRE proof of concept. Most of the time, though, this is a strategic decision for the organization, which is outside the scope of this article. Possible candidates can be a team shifting from traditional operations or DevOps to SRE, or a need to increase reliability to a business-critical product. No matter the reason, it’s crucial to select an application that is:
Critical to the business. Your customers should care deeply about its uptime and reliability.
Currently in development. Pick an application in which the business is actively investing resources.
In a perfect world, the application provides data and metrics regarding its behaviour.
Conversely, stay away from proprietary software. If the application wasn’t built by you, it’s not a good candidate for SRE! You need the ability to make strategic decisions about—and engineering changes to—the application as needed.
Pro tip: In general, if you have workloads both on-premises and in the cloud, try to start with the cloud-based app. If your engineers come from a traditional operations environment, changing their thinking away from ‘bare metal’ and infrastructure metrics will be easier for a cloud-based app, as managed infrastructure turns practitioners into users and forces them to consume it like developers (APIs, infrastructure as code, etc.)
Remember: Set realistic goals. Discouraging your team with unrealistic expectations early on will have a negative effect on the initiative.
Step 2: Empower your teams
Implementing SRE principles requires fostering a learning culture, and in that regard, team enablement means both training them, i.e., in regards to knowledge, as well as empowering them.
Building a training program is a topic in and of itself, but it’s important to think about an enablement strategy at an early stage. Especially in large organizations, you need to address topics like internal upskilling, hiring and scaling the team as well as onboarding and creating a learning community.
Your enablement strategy should also accommodate employees at different levels and in different functions. For example, higher leadership’s training will look very different from practitioners’ training. Leadership’s education should be sufficient to get buy-in and to be able to make organizational decisions. To drive change in the entire organization, additional training to leadership on cultural concepts and practices might be required.