Video thumbnail

SREcon24 Americas - Real Talk: What We Think We Know — That Just Ain’t So

USENIX18 de abril de 2024

This discussion critically examines common misconceptions and conventional wisdom within the Site Reliability Engineering (SRE) community, urging a more scientific and skeptical approach to established practices. The speaker emphasizes the importance of questioning deeply held beliefs, much like debunking general misconceptions such as the Great Wall of China being visible from space or a dropped penny from the Empire State Building being lethal. The core argument is that for a field to be considered scientific, it must acknowledge past errors and maintain productive skepticism about its assumptions. The presentation delves into specific areas where conventional wisdom might be flawed, including the measurement of engineer productivity, the role of change in incidents, the linearity of incident response models, and the concept of repeat incidents. It highlights that an illusion of knowledge can be a greater barrier to discovery than ignorance, advocating for a continuous process of validation and inquiry over blindly accepting norms. The ultimate goal is to foster a community that openly questions and refines its understanding of complex systems, ensuring that practices are based on empirical evidence rather than unchallenged assumptions.

Questioning Conventional Wisdom

The speaker begins by underlining the importance of critical thinking within the SRE community, drawing a parallel to general misconceptions that, while harmless, illustrate how easily unsubstantiated beliefs can spread. The example of the Great Wall of China not being visible from space or a penny dropped from the Empire State Building not being lethal serves to highlight how widely accepted myths can be. In the context of SRE, the argument is that some deeply ingrained beliefs might similarly be "what you know for sure that just ain't so," a direct reference to a quote attributed to Mark Twain.

It ain't what you don't know that gets you in trouble. It's what you know for sure that just ain't so.

This sets the stage for a discussion on challenging assumptions that could have significant real-world consequences in system reliability and operations. The speaker quotes his colleague David Woods, emphasizing that a scientific field must be willing to admit past errors and consistently apply skeptical inquiry to its foundational concepts. This continuous questioning and validation are crucial for the SRE community to evolve and mature.

Engineer Productivity Metrics

One primary misconception discussed is the historical approach to measuring engineer productivity. Prior to the early 1990s, the dominant methodology involved simply counting lines of code written by engineers. This metric, while seemingly straightforward, proved to be highly misleading and ineffective.

The use of lines of code metrics for productivity and quality studies is to be regarded as professional malpractice starting in 1995.

This quote, attributed to Capers Jones, marked a significant shift in thinking. The speaker highlights that this method, which carried over from the days of Fortran and COBOL on punch cards where lines of code had a physical manifestation, did not accurately reflect productivity or quality in modern software development. This serves as a historical example of the community confronting and debunking a widely accepted but flawed idea.

Changes as the Sole Cause of Incidents

Another common belief challenged is the notion that changes are either the only or the leading cause of incidents. This perspective often leads to a simplistic view where "change" is inherently bad. While Gartner in 2015 reportedly stated that 85% of performance incidents could be traced to changes, the speaker introduces a nuanced counter-argument.

He proposes that changes are also a leading cause of resolving incidents and that all prevented incidents are likely triggered by making changes. This highlights the dual nature of change: it can introduce issues, but it is also essential for progress, improvement, and problem resolution. The speaker advocates for a productive skepticism towards fuzzy or implicit assumptions about change, urging the community to validate whether these assumptions hold true under scrutiny.

Linear Incident Response Models

The presentation critiques the common depiction of incident response as a neat, linear sequence of steps or phases. Diagrams online often present incident management as an orderly process with distinct stages like "diagnosis" and "mitigation." However, real-world incident experiences rarely conform to such clean models.

The speaker recounts an oversimplified real incident scenario to underscore this point: a problem is noticed, it escalates, attempts are made to fix it, and then its resolution is assessed. He poses the critical question, "Where is diagnose?" challenging the notion of clearly defined, sequential phases. The reality is that cognitive work during an incident is dynamic, intertwined, and continuous, not easily captured by discrete steps. Despite this, such linear models often drive consequential business decisions, resource allocation, and reporting, creating a disconnect between the model and the actual messy nature of incident handling. The speaker references a model developed by Dr. Woods and others, based on cases in nuclear power control rooms, which depicts anomaly response as a non-linear, intertwined, and interdependent process, better reflecting the cognitive complexities involved.

The Concept of Repeat Incidents

The discussion also addresses the concept of "repeat incidents," questioning how organizations define and treat them. While some organizations explicitly count and tabulate repeat incidents for reports, the speaker argues that the criteria for labeling an incident as a repeat matter more than the mere existence of a "repeat."

He points out that what constitutes a "repeat" can vary significantly among individuals and organizations. For example, is it a repeat if it occurs at the exact same Unix epoch, or if the same people respond, or if it simply feels similar? This ambiguity reveals different perspectives and views, which can be valuable. However, it also highlights the potential for arbitrary categorization, especially when consequential decisions are based on these counts. The speaker cites examples from the "the void" (a curated collection of incident write-ups) where multiple occurrences are detailed within a single write-up, raising questions about whether they constitute one incident or several "repeats" which could skew metrics like incident averages. The speaker emphasizes that the richness of incident experience lies not in metrics but in the surprising, confusing, or frustrating qualities remembered by practitioners.

Incident Response vs. Stakeholder Communication

Finally, the speaker distinguishes between effective incident response and efficient stakeholder communication, asserting that an organization can excel at one while being terrible at the other. Keeping stakeholders updated is an important aspect of incident management, but it is distinct from the hands-on work of anomaly response.

He argues that the most aspirational and ideal scenario for incident response involves practitioners who can immediately recognize what is happening and know exactly what to do about it. Anything that supports, expands, or augments the expertise of these hands-on practitioners is a worthwhile and productive investment. This includes practices like code reviews and the sharing of knowledge between tenured veterans and new hires. The key idea is that incident response is multifaceted, not monolithic. While all aspects are important, optimizing for fluid, effective handling by expert practitioners should be prioritized, as this often leads to incidents being resolved so seamlessly that they might not even be formally labeled as "incidents" due to their low disruption.

Takeaways

Productive Skepticism: The SRE community must embrace productive skepticism and question conventional wisdom and implicit beliefs to evolve as a scientific field.
Critique of Metrics: Traditional metrics like "lines of code" for engineer productivity, or simple counting of "repeat incidents," are often misleading and should be critically re-evaluated.
Nuance of "Change": While changes can cause incidents, they are also crucial for resolving issues and preventing future problems; avoid simplistic views of change as inherently "bad."
Non-Linear Incident Models: Real-world incident response is dynamic, intertwined, and continuous, not fitting neatly into linear, sequential models often used for reporting and decision-making.
Expertise Over Metrics: Prioritize investments in expanding and diversifying the expertise of hands-on practitioners for effective incident response, as this is more critical than precise real-time customer impact metrics.

References

This article was AI generated. It may contain errors and should be verified with the original source.

ClarifyTube

© 2025 ClarifyTube. All rights reserved.