Why calling everyone a SRE is not going to stick
We are at the beginning of 2020, and there is an emerging trend to bring clarity to the devops world by just squashing out devops engineer position and call everyone implementing it as a SRE (Site Reliability Engineer) instead. While the intent is definitely good, as there is just too much of chaos associated with the word devops, but I do strongly believe that calling everyone a SRE is just not going to stick. And rest of this story is my opinion and reasoning behind it.
Understanding the origin of Devops “Engineering” ?
No matter the differences, Devops Engineer and Site Reliability Engineer have one word in common i.e. "Engineer". I have a degree in Computer Science and Engineering, however I have never tried to understand the true meaning of Engineering, atleast until today.
Today, I set myself on a quest to find the true meaning of the word Engineering to understand what it means to be an engineer. I started with the omniscient wikipedia as per which, Engineering is about using scientific principles and methods to build systems , complex or simple alike. Now lets go deeper and understand what a "Scientific Method " is. Again, as per wikipedia "scientific method is an empirical method of acquiring knowledge that has characterised the development of science…. "
The word empirical originated from the greek word empeiría meaning experience. So empirical learning is the one which is based on what you experience. And using scientific methods involve roughly the following steps
Observe : observe and analyse how things work
Experiment : try things out, observe how and what works, what does not
Hypothesize : build your hypotheses based on what you observe and induct (instead of deduction) . The difference between induction and deduction is this. With induction, your premise supplies evidence to the conclusion, and is probable. Whereas the conclusion of deductive argument is certain.
Apply Skepticism: we all have our biases, so keep applying scepticism to your hypotheses to eliminate biases and errors from your hypothese
Repeat: keep on iterating this process to further refine and course correct with the changing dynamics
The more I read about all this, the more I understand why Devops evolved the way it did. In 2009, when Patric Debois started this as a conversation by organising a devopsdays event in Ghent Belgium, it was just meant to talk about how to bring agility to infrastructure and break the barriers between developers and operators. This would help the organisation move faster and deploy swifter, a goal of a product development team, at the same time not compromising the reliability and availability of the application, a seemingly contrasting goal of operations team. Even after this event, people stuck on to that conversation, creating a hashtag of #devops for sake of brevity, and turned in to a all encompassing major movement in the software world.
I am convinced that the true goal of Devops Engineering is to use the scientific methods as described above to come up with the path to achieve reasonable balance between agility and reliability. Its important to deploy fast to build and push features that matter to the customers, which would enhance the experience offered to the end user and also to set yourself ahead of, or at least in par with the competition. At the same time, its also true that unless your systems are reliable, all those new features that you are pushing out, hardly matter. Site Reliability Engineering takes a scientific approach to achieve this by observing, experimenting, hypothnsizing, and refining it by using a iterative process.
The SRE Use Case of Dododmart
Lets look at an example. Ram is managing a SRE team supporting a large scale web infrastructure for Dodomart, the leading ecommerce platform in India. Dodomart announces the biggest sale of the year, the Diwali Sale, for the first time ever. As a result of a successful and large scale marketing campaign, on the very first day of the sale, there are lacks ( 1lac = 100k) of requests coming in per second, slowing down the systems to the point that it's almost unusable. They even have a downtime for a few minutes. Since the SRE team is constantly monitoring it, they observe the spike in the traffic as well as the latency going up, and then manually and continuously scale the systems up by adding the additional capacity needed. They are the heros of the day as they manage to keep the system running throughout , but this is typically a reactionary approach. So after this incident, Ram brings his SRE team together, and without blaming who was responsible for the high latency and downtime, they collectively observe and analyse what happened, what went wrong, and what could they do to avoid this in future. And it does not stop there. Based on their learnings, in the weeks to come, they build systems to automatically scale out /scale in their infrastructure based on the traffic, use custom metrics to trigger the scaling, tweak and optimise the configuration of the servers, introduce a hybrid cloud deployment tool and even work with the development teams to incorporate timeouts and retries in the application code. They implement the system and refined it multiple times changing their initial premises and course correcting it. All of this work results in the next major sale announced by Dodomart going smoother than ever, with no hiccups with the infrastructure or the ability to scale.
This is the difference between being a reactionary systems administrator in the past where you relied too much on manual processes and ad hoc approaches, being a hero of the day; versus applying scientific methods and taking an engineered approach towards managing the systems with much more predictability and much less of heroism. And thats what I would like to call Devops Engineering. And the people who implement it in organisations such as Google are known as Site Reliability Engineers.
In addition to applying the Devops engineering approach, the The signature of Site Reliability Engineers is they are not only apply the devops approach, but also are typically responsible for maintaining a high traffic, public facing site. When you are in charge of such as site, you need to think about the smallest of optimisations, performance engineering, on call monitoring etc. and thats the life of a SRE.
Why Calling everyone a SRE is not a great idea
Now, lets come to why calling everyone a SRE and avoid the chaos caused by different interpretations of Devops does will not stick, as that seems to be the trend of the day. Premise for labelling everyone a SRE is that it would bring more clarity to the devops world. And this is typically what the organisations with very clear SRE practices, who are also the thought leaders, and the people who listen to their words and repeat would love to preach the world.
However, I sincerely believe the though leaders sitting on mt. Everest and looking at the world at a distance of thousands of feet, are clearly unaware of the ground realities. Even though, I wish there would just be one definition and interpretation of Devops, the ground reality is far from it. And there is a reason behind this chaos. And I believe that tagging everyone as SRE would not only NOT work, but also would not make sense. And what I say here is based on my observations, experiments, hypotheses which have refined over time.
I really wish the world was a simpler place, and everyone had the clarity that lets say organisations such as Google (just an example, because they definitely are very clear about SRE practices, and promote those very nicely. However, there are a few others who follow SRE to the core ) would have. They have a very clear definition of who a SRE (team/personnel) is, and what are they supposed to do (implement Devops ) . It would work really well if you are running things like google does. However, not every organisation is same, neither in the structure, nor with the nature of their product, or the way its been architected.
Lets look at some examples to understand the distinction. Most (I say most) of the products delivered by Google or the likes are completely web /api based applications. In such cases, a role of a SRE is critical, as these are the sites which need to be available, scaleable, secure (reliable) and someone needs to be dedicatedly be working towards making those reliable. People who do that take a devops engineering approach towards it. Inadvertently, being on call, using observability tools, and at times taking reactionary measures they are responsible directly to maintain [SLIs/SLOs and indirectly responsible for SLAs](SLIs, SLOs, SLAs, oh my! (class SRE implements DevOps) - YouTube).
Now lets look at other set of organisation, and this could be surprising for the SRE promoters, but yes, everyone else exists on this planet, and they are also trying to improve their processes, be more agile, and at time are just looking to simplify and expedite their processes. Let me give you an example.
Lets say there is a organisation which builds and ships network device with a software component managing the network protocols and the communications. Their product is not web based, they don't have to host it on cloud and keep running it. Sure they could package observability tools which could help with setting SLIs, but thats not what people building and supporting this software development need. In such cases, SRE becomes irrelevant, but not Devops. I could still use the principles and practices of Devops while building such a software and improve my processes, reduce defects, and create predictable, automated release cycles. And I would rather call the person who does this as a Devops Engineer and not SRE.
Lets take another example. I know a lot of enterprises who are still running monolithic, stand alone applications, and would continue to do so for at least next few years to come. Do they have to care about SRE ? I don't think so. Will they still benefit from Devops practices such as Infrastructure as a Code for installing and configuring their applications. Hell Yeh !
There are many examples out there and I have seen many clients being confused because of the utter chaos caused by so called thought leaders preaching based on their own narrow assumptions about how the world exists. However, the good news is the world will always eventually see the sense and will automatically bypass such preached philosophies and to choose the ones which make sense. And thats been the key character of the Devops movement. Devops offers guiding principles and practices. The implementation of which should completely be context based. For some, implementing it by forming SRE teams, makes absolute sense. For everyone else, using a generic name of Devops Engineer and looking for specific technologies makes more sense. The chaos starts when one starts preaching and enforcing their narrow view as what Devops is all about, and others start blindly listening to it.
In conclusion, I believe while SRE is great for specific organisation and I would go on to call them elite squad of devops engineers as they end up working on high traffic, public facing sites and have to be on their toes dealing with performance, scalability, optimisation issues and should be promoted wherever applicable, but the position of Devops Engineer is here to stay, is more generic and applicable to wider set of organisations. If everyone starts calling themselves a SRE, it will not only water down the SRE practices, but also shift the chaos from devops world to SRE.
About: : I am the author of the edX course Introduction to Site Reliability Engineering and DevOps and have published many courses on the topic of Devops and SRE. You could find the list of all my courses here http://schoolofdevops.com].