Head of Site Reliability Engineering

THRYVE

Head of Site Reliability Engineering (SRE)

Location: Stuttgart | Full-Time | Hybrid

About the Role

We are seeking an experienced and visionary leader to head our Site Reliability Engineering function. This role will be responsible for building and scaling the SRE organization, aligning operational excellence with business strategy, and ensuring the resilience, security, and performance of our platforms.

Key Responsibilities

Team Leadership & Development: Build, lead, and mentor a high-performing SRE team. Drive a culture of collaboration, accountability, and continuous improvement. Set measurable goals for individual and team growth, and lead regular performance and feedback sessions.
Strategic Planning & Roadmap: Define and maintain an SRE roadmap aligned with overall technology and business objectives. Lead the planning and execution of strategic initiatives focused on scalability, reliability, and cost-efficiency.
Operations Consolidation: Lead the unification and operational integration of systems across multiple product lines, including those from recent acquisitions, ensuring consistent reliability and performance standards.
Cloud Infrastructure Strategy: Direct the evolution of cloud strategy, with a focus on phased migrations to cloud platforms (primarily Azure). Ensure secure, scalable, and cost-effective cloud operations.
Business Continuity & Disaster Recovery: Own and enhance BCDR strategies to protect against service disruptions and data loss. Ensure resilience planning is embedded in all infrastructure and platform decisions.
Monitoring & Observability: Optimize and scale observability tools to enable proactive incident response. Own the lifecycle of monitoring—tooling, coverage, alerting, and reporting.
Deployment Process Optimization: Drive efficiency and reliability in CI/CD pipelines, promoting automation and best practices across environments and teams.
SRE Metrics & Reliability Engineering: Implement and maintain metrics such as SLIs, SLOs, and error budgets to monitor and improve service reliability across systems.
Incident & On-Call Management: Establish and continuously refine the incident management process, including on-call rotations, response protocols, and post-incident reviews.
Risk & Security Oversight: Identify and mitigate infrastructure and operational risks, ensuring systems are compliant and resilient to security threats.
Service Operations & Support: Oversee internal service desk functions, driving SLA performance and improving operational responsiveness.
Team Process & Planning: Lead planning cycles, retrospectives, and agile process improvements to ensure transparency, alignment, and adaptability.
Reporting & Visibility: Provide executive-level reporting and dashboards that highlight key metrics, trends, and areas for improvement in reliability and operations.
Stakeholder Engagement: Act as a primary liaison with internal teams, external partners, and vendors. Manage expectations, ensure alignment, and deliver high levels of service.
Documentation & Knowledge Sharing: Ensure processes, systems, and procedures are well documented to support scale, resilience, and team autonomy.

Qualifications

Proven experience in a senior SRE or engineering operations leadership role.
Deep knowledge of SRE methodologies, cloud platforms (especially Azure), and modern DevOps practices.
Strong leadership track record with experience mentoring teams and driving organizational change.
Hands-on experience with infrastructure automation (e.g., Terraform, Ansible), CI/CD tooling, and container orchestration (Kubernetes, Docker).
Expertise in incident response, BCDR planning, and SRE performance measurement frameworks.
Ability to align technical operations with business objectives and communicate effectively across all levels of the organization.
Track record of managing complex cloud migrations and system integrations.

Head of Site Reliability Engineering

THRYVE

Jetzt Bewerben

Job Location

Stelle mit anderen teilen: