Valuable insights
1.SRE Principles Simplify Housing Management: Applying established Site Reliability Engineering practices to the management of multi-apartment buildings can significantly simplify complex operational challenges faced by homeowners associations (TSZh).
2.Organizational Structures Mirror IT Functions: The organizational hierarchy of housing management—involving management companies, inspectorates, and resource providers—finds direct parallels in IT roles like exploitation services, quality assurance, and compliance monitoring.
3.Cross-Functional Teams Essential for Repairs: Siloed work among specialized contractors leads to project failure; effective complex repairs require truly cross-functional engineering teams capable of holistic problem resolution.
4.Accountability Demands Delegated Leadership: Decision-making authority must be delegated to a specific, compensated leader who bears direct responsibility, rather than relying on fluid roles within a group structure.
5.Formalize Strategic Planning Meetings: Capital repair planning sessions must adopt IT strategic meeting rigor, including mandatory agendas, defined success criteria, and documented outcomes to ensure efficiency.
6.Adopt the 'Agree and Commit' Principle: Once a strategic decision is reached through voting or consensus, all team members must adhere to the final plan, even if they initially held a minority dissenting view.
7.Address Root Causes, Not Just Symptoms: Apparent sabotage, such as leaving doors open, can be resolved by addressing the underlying user friction, like installing necessary amenities such as benches.
8.Optimize Communication Channels for Efficiency: Large, open community chats are prone to negativity and inefficiency; communication should be focused through announcement channels and structured ticketing systems.
9.Measure User Comfort as Key SLOs: System observability must track user comfort metrics, requiring clear definition of service level objectives (SLOs) and separation between technical and business indicators.
10.Mandatory Incident Response Documentation: Effective incident handling requires 24/7 on-call rotations, comprehensive runbooks detailing recovery steps, and mandatory post-mortems for organizational learning.
Introduction and Structural Parallels in Operations
Gleb Goncharov presented an analysis comparing the operation of complex computer systems with the management of multi-apartment buildings, drawing parallels from the perspective of a Site Reliability Engineer (SRE) and a chairman of a homeowners association (TSZh). The speaker detailed personal experience managing monolithic systems at SberMarket alongside serving as chairman for a 48-unit building for two years. This dual perspective revealed fundamental similarities in the underlying management principles governing both physical infrastructure and IT systems.
Mapping Housing and IT Organizational Structures
The organizational structure of housing management closely mirrors IT departments. Resource companies act as external providers, while management companies (UK) function as exploitation services. The Housing Inspectorate serves as a quality assurance body, monitoring service execution. Furthermore, financial reporting aligns with tax and social insurance functions. The building steward's role, representing owner interests to management and resource organizations, directly correlates with the function of an SRE team member or leader within a project.
Operational Commonalities Across Domains
Both housing and IT systems share core operational necessities. They require regular maintenance, configuration adjustments, and planned introduction or retirement of equipment. Crucially, both demand continuous observation to ensure uninterrupted service; for instance, heating systems must operate around the clock during winter. Security—whether physical protection against intrusion or digital defense against data leaks—falls under the operational service's purview. Finally, both domains necessitate preparedness for unexpected failures and rapid restoration of serviceability.
Acknowledging Domain-Specific Differences
Obvious distinctions exist primarily in the subject matter. Building operations inherently involve physical space, construction, plumbing, and electrical knowledge. Conversely, IT operations require expertise in development, design, and testing methodologies. Software updates are generally more frequent in IT than physical modernization in housing, which can be costly. However, considering the sheer scale of housing infrastructure nationally, these differences are considered less fundamental than the shared principles of engineering and management underpinning both spheres.
Applying SRE Principles to Housing Operations
The discussion transitioned into practical case studies illustrating the application of SRE concepts. The first example involved selecting a new management company, mirroring the process of vetting an operations provider. Initial selection criteria focused heavily on ensuring the company possessed a diverse set of in-house engineers—plumbers, electricians, carpenters—with defined qualification tiers and ownership of necessary tools, such as a snow-clearing grader.
The Imperative for Cross-Functional Teams
A specific repair project for a building entrance failed because specialized contractors operated in isolation, each responsible only for their narrow zone. This demonstrated a management error requiring a truly cross-functional team approach. Such a team, analogous to an SRE unit, should incorporate diverse skill sets, utilize grading matrices to define competency levels, and maintain small, effective sizes to ensure efficient communication.
- Possessing a grade matrix where competency levels are clearly defined, similar to IT roles.
- Maintaining a technology radar and creating roadmaps for future planning.
- Operating with a small size, typically not exceeding ten members for effective communication.
- Sharing common goals and values, often facilitated by frameworks like the Team Canvas.
Leadership, Accountability, and Delegation
A second story highlighted leadership through the issue of pigeons nesting and fouling an open balcony. While previous attempts to solve this over 12 years failed, the new TSZh chairman quickly resolved it by consulting the association, allocating budget, and installing polymer netting in one day. This swift resolution demonstrated that responsibility must be tied to a designated, compensated leader who is empowered to act.
- Every process must have a single, responsible executor.
- Decision-making authority should be delegated to the interested leader who receives compensation for results.
- Direct responsibility must be maintained, ensuring that 'blameless' culture does not equate to irresponsibility.
Strategic Planning and Weighted Input
The annual voting for capital repairs serves as the housing equivalent of an IT infrastructure development plan. These meetings are resource-intensive and necessitate a formalized agenda and clear success metrics. The strategy itself benefits from being drafted by a designated leader, informed by input from a council of experts. Furthermore, adherence to the 'Agree and Commit' principle post-decision is vital for execution progress, mirroring IT expectations.
- Maintaining detailed minutes or transcripts of meetings for future reference.
- Assigning greater weight to senior engineers' votes on strategic matters, analogous to property share weighting in housing votes.
- Ensuring all participants commit to the final decision, regardless of initial disagreement.
Communication, Observability, and Change Management
Even a well-thought-out strategy can fail due to local resistance or sabotage. One incident involved a resident intentionally keeping the main entrance door unlocked during summer, motivated by the desire to have a bench installed for elderly residents' convenience. The issue was resolved not by enforcing rules, but by installing the desired amenity, proving the importance of addressing user friction points directly.
Effective Communication and Transparency
Resolving disagreements and fostering trust requires properly organized synchronous and asynchronous communication channels. A unified Kanban board or ticketing system allows open tracking of progress and feedback. Transparency is further boosted by publicly posting announcements and operational materials on an official website, which builds user confidence in the provided information.
- Avoid creating large, general community chats, as they often devolve into negativity and rarely produce actionable decisions.
- Ensure meetings include only essential participants, recognizing that universal communication is inherently inefficient.
- Utilize a dedicated announcement channel as the best method for broad, one-way communication.
Monitoring User Comfort and System Health
A resident complained of persistent coldness, leading to threats of legal action. Investigation revealed the issue was not the central heating system but poorly sealed windows specific to that unit—a failure in user understanding of service norms. This highlights the need for observability that measures user comfort alongside technical performance, requiring defined SLOs and service tiering based on impact.
- Employing suitable telemetry tools like Prometheus, Victoria Matrix, or the ELK stack for data collection.
- Differentiating indicators into technical, user-facing, and business metrics.
- Conducting regular reviews of SLO performance, often categorized by service tiers with differing quality requirements.
Managing Planned Work via Change Control
Planned maintenance, like the scheduled snow removal requiring temporary parking restrictions, can fail due to inadequate preparation. The failure to coordinate resident vehicle movement demonstrated poor change management execution. Effective planned work in IT requires strict adherence to change control processes to prevent overlaps and ensure smooth deployment and rollback procedures.
- Maintaining a shared calendar of all planned work to prevent conflicts and increase team transparency.
- Analyzing time slots based on risk profiles to categorize maintenance activities.
- Developing comprehensive rollback plans, which forces clarity on potential system impacts.
Incident Response and Continuous Learning
A major water pipe burst exposed severe deficiencies in the incident response chain: slow detection, inability to contact emergency dispatch, delayed access to the basement utility shutoff, and prolonged repair time due to missing spare parts. This cascading failure underscores the necessity for documented, rehearsed procedures, similar to how IT systems require defined Disaster Recovery plans.
Building a Resilient Incident Response Framework
- Adherence to security protocols (IDS/IPS) integrated with logging and tracing systems.
- Establishing 24/7 on-call duty rotations to ensure immediate response and provide feedback to developers.
- Maintaining readily available contact lists and documented recovery instructions, known as runbooks.
- Storing runbooks in static site generators to enable automated validation and review by support staff.
- Tracking key performance indicators like Mean Time To Acknowledge (MTTA) and Mean Time To Restore (MTTR).
- Writing detailed post-mortems for every incident to capture lessons learned and prevent recurrence.
Finally, organizations should proactively test failure modes through Chaos Engineering. This methodology involves formulating stability hypotheses and intentionally injecting failures to verify resilience, rather than merely executing steps for system shutdown. This proactive approach ensures that systems are robust against unexpected conditions.
Q&A Insights on Implementation and Roles
During the Q&A, it was noted that in housing management, the chairman often acts as an 'orchestra person,' handling accounting, engineering, and legal duties. IT teams should avoid this single point of failure by integrating SRE practices into cross-functional team functions. Furthermore, while full synchronization of mental models across teams is impossible and potentially detrimental, continuous recalibration is necessary to maintain diverse perspectives on system architecture.
Useful links
These links were generated based on the content of the video to help you deepen your knowledge about the topics discussed.