Nine years ago, I wrote a blog post to more fully explain a core organizational outcome of DevOps and cloud computing. In that post, I focused on operations and how operations activities would be organized, including via automation. I argued that a separation of concerns was required to make public cloud adoption easier, and that there was a new layer of technology services that needed to be operated. The resulting model looked like this:
It has been fascinating to see the slow evolution of IT practices toward this model. To be sure, its taken much longer than I anticipated, and the labels are frequently very different, but we are seeing this basic model arise within many enterprises. However, this model is now a bit simplistic, and I think it's time to update it with a bit more detail.
The reason this evolution is taking place is simple: The core role of IT is to still find the software that benefits the business, and to match the software with the hardware required for execution. The difference today, however, is scale. If you define a deployment as a single software package being released one time to the required hardware, then the number of deployments has skyrocketed due to power law increases in both the number of software packages and the number of releases.
There’s more to it than just ‘DevOps’
This massive growth in scale has required an evolution in practices and organization to achieve success. Most of what technologists are aware of in this regard is labeled “DevOps,” but there is more nuance to it than that. The way infrastructure capacity is allocated becomes decoupled from specific hardware, so the infrastructure team has to adapt new tools. The way databases and message busses, among other things, are operated and made available to applications has become more “self-service”, and thus those teams have to see themselves as service providers rather than as infrastructure teams.
This is not an easy transition without conscious and concerted efforts around maintaining the right communication mechanisms and separations of concerns. Here’s how I would model the technology teams of a hypothetical IT organization designed to deliver custom software to production quickly and safely:
Let’s break this down a bit further.
Application Development/Operations is the collection of teams that are responsible for building end-user business applications (both customer-facing and employee-facing) and business-specific backend services. These teams are entirely business-objective-focused, and are nowadays tied directly to specific business units. There is no distinction between a “dev” org and an “ops” org, because modern development practices (aka DevOps) have largely eliminated that distinction for these teams.
Platform Services is the collection of service delivery teams that simplify and accelerate software development without adding unacceptable risks to the operation of the business. Platform Services primarily serves the application teams, have a product and/or service delivery mentality, and the service they deliver may be best described as “Path-to-Production-as-a-Service.”
There are three distinct teams that will typically reside within the broader Platform Services team:
Release Management (CI/CD) manages the key developer-facing interface for building and deploying applications as services. They are generally responsible for elements of the path to production that are process-oriented, such as onboarding, build chains, test automation mechanisms, and so on.
Platform is the team that operates the runtime environment for custom applications. A good platform team is able to automatically update and patch the platform itself, as well as key underlying components of the application builds, such as operating systems, runtime libraries, network configurations, etc.
Technology Services is made up of teams that run various key software infrastructure services, such as datastores, analytics services, message busses, queuing, caches, and so on. These tend to be specialist teams. They also tend to have a better staff-to-customer ratio than past methods of delivering these types of infrastructure, because their use of automation, standard patterns, and other platform elements enables each offering to be run like a SaaS service with new economies of scale. These services are typically made available to developers through the platform.
Infrastructure Services is the collection of teams that deliver computing, network, and storage capacity to applications securely, with reliability and predictability as key objectives. These teams serve the platform services team in this model, although there may be some edge cases where they directly serve an application development/operations team.
There are two (or more) distinct teams that will typically reside within the broader Infrastructure Services team:
Data Center Automation
Data Center Automation delivers software abstractions that enable physical capacity to be decoupled from software deployment, such as server virtualization, container management, software-defined networking, and so on. This is a critical capability, because it generally decouples the time it takes to request capacity from the time it takes to acquire hardware—which in the past has been a major constraint to the timely delivery of software.
Server, Network, Storage, and Facilities
Server, Network, Storage and Facilities are the traditional organizations that manage the physical elements of information technology, including computers, switches and routers, disks and tapes, and data center facilities, respectively.
SRE and InfoSec span everything
There are a couple of other teams that are critical to the success of a modern IT development organization. These are teams that work closely with all of the teams above to focus on key specific outcomes that are systemic, and therefore dependent on coordination across applications, platform services, and infrastructure. They are System (or Site) Readiness Engineering (SRE) and Information Security (InfoSec):
Briefly, these teams deliver the following services:
Systems (or Site) Reliability Engineering
Systems (or Site) Reliability Engineering (SRE) is "what happens when a software engineer is tasked with what used to be called operations." In a way, it’s an organization that looks at operations automation as both a product and a field of study. A typical SRE organization works with application developers, platform team members, and infrastructure operators to tune the overall IT systems environment to optimize availability, reliability, performance, and maintainability. They may coach developers on application architecture or coding elements, while simultaneously working with the infrastructure team on network optimization.
Information Security (InfoSec)
Information Security (InfoSec) is ultimately responsible for reducing the risk related to an organization’s information portfolio, a practice typically requires a holistic, systemic view of the IT systems environment. Here, too, software development practices have become a dominant skill for InfoSec practitioners, but elements of data center, network, and software configuration, monitoring, “game play,” and incident response are key practices in this space. The InfoSec team will also work with all of the other teams mentioned in this post to protect company data.
It is critical to not ignore or misalign these organizations, as the guardrails and practices they deliver are critical elements in a risk-mitigation strategy. Guardrails should be automated where possible and incorporated in the path to production. Both InfoSec and SRE should be present during the design, code review, testing, and delivery phases of the software cycle. They should also be leaders in incident review and support escalation discussions.
The goal is fluidity, not hierarchy
When you evolve toward the structure described above, you get a powerful construct that enables these best practices with respect to software development, drives a “customer-focused” path-to-production organization, and decouples hardware acquisition from capacity allocation. The organization gains the ability to deliver software with fewer process and resource barriers, which speeds up software development cycle times and enables the business to evolve faster. And all of this is delivered without losing focus on information security or application availability.
It’s important to note that within each of these boxes, what has consistently shown as being the most successful organizational model is one of cellular, autonomous teams with direct communication to the services they need to consume to complete their mission. This model is in no way meant to convey a rigid, hierarchical organization, because data shows that such organizations fail to keep pace with more fluid, “two-pizza team” models.
This isn’t just speculation. For more on the data that supports this, I highly recommend reading Accelerate by Nicole Forgren (PhD), Jez Humble, and Gene Kim.
About the AuthorMore Content by James Urquhart