Incident Response Manager
Who we are
About Stripe
Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.
About the team
The Incident Ops team is a global 24/7 team responsible for driving incident response and management from detection to resolution. Stripe is proud of its five 9s reliability and this team is at the forefront of ensuring we keep it that way - working hand-in-hand with Reliability Eng and across the Tech Org. This team of incident response managers (IRM) is defined by our sense of ownership and how we drive incidents to resolution - marshaling the necessary cross-functional resources to respond to and resolve service outages, critical bugs, security attacks and anything that significantly impacts the users of our products. The team is user-first and ensures appropriate external communications from Stripe and senior management to keep our users informed of disruption to their experience of Stripe. The team is skilled in communications, incident handling and technical adeptness as incidents can arise from anywhere and cut across products and orgs in Stripe.
What you’ll do
As an Incident Response Manager (IRM), you’ll play a crucial role in driving the right level of response from Stripes to incidents, determining impact, rallying Stripes to mitigate, communicating to users and ensuring appropriate remediations and orchestrate the Root Cause Analysis (RCA) process. You’ll work closely with IRMs and responding teams globally to ensure solid 24/7 coverage on how we monitor, detect, respond, communicate and mitigate incidents. You’ll focus on developing your skills in incident management, communication, and technical understanding of Stripe’s products and services. When not managing incidents, you'll contribute to improving our operations. You’ll focus on developing your skills in incident management, communication, and technical understanding of Stripe’s products and services.
Responsibilities
Act as an Incident Commander for incidents across various classes (reliability, technical, data privacy, product, or security), driving incident resolution with urgency and cross-functional collaboration
Lead all user-facing incidents across domains at Stripe - including reliability, technical, security, and data privacy
'User First' approach to determine impact, providing accurate situation reports, facilitating comms bridges, and ensuring useful and timely external communications to users
Update internal stakeholders and support decision-making processes during incidents
Participate in the root cause analysis process, conduct post-mortems for routine incidents, and identify remediations
Collaborate with engineering, product, and operations teams to improve incident handling processes and tooling
Contribute to team culture and processes that enhance incident response capabilities
Who you are
We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.
Minimum requirements
3+ years of demonstrable major incident experience for organizations that run mission critical applications or always-on Saas environments.
Demonstrated ability to independently lead multiple incidents concurrently with minimal support and guidance from senior team members
Basic understanding of application development, architectures, and cloud environments
Familiarity with infrastructure concepts, including physical, virtual, and container-based compute platforms
Practical experience using modern monitoring and telemetry tools such as Splunk Prometheus, and Grafana
Basic data analysis skills using SQL, Splunk or other tools.
Strong task management skills, with attention to detail and ability to remain composed in high-pressure situations.
Good written and verbal English communication skills, with the ability to translate complex technical issues for various stakeholders.
Preferred qualifications
Familiarity with different types of incidents such as technical, privacy, security, or crisis with eagerness to continually learn about Stripe's products and systems.
Experience in conveying key details of technical issues to stakeholders
Experience with broad public-facing communications (e.g. status pages, tweets) and/or targeted communications (e.g. direct emails, support ticket responses).
Familiarity with distributed architectures and system inter-dependencies which operated in a cloud environment.
About the job
Apply for this position
Incident Response Manager
Who we are
About Stripe
Stripe is a financial infrastructure platform for businesses. Millions of companies—from the world’s largest enterprises to the most ambitious startups—use Stripe to accept payments, grow their revenue, and accelerate new business opportunities. Our mission is to increase the GDP of the internet, and we have a staggering amount of work ahead. That means you have an unprecedented opportunity to put the global economy within everyone’s reach while doing the most important work of your career.
About the team
The Incident Ops team is a global 24/7 team responsible for driving incident response and management from detection to resolution. Stripe is proud of its five 9s reliability and this team is at the forefront of ensuring we keep it that way - working hand-in-hand with Reliability Eng and across the Tech Org. This team of incident response managers (IRM) is defined by our sense of ownership and how we drive incidents to resolution - marshaling the necessary cross-functional resources to respond to and resolve service outages, critical bugs, security attacks and anything that significantly impacts the users of our products. The team is user-first and ensures appropriate external communications from Stripe and senior management to keep our users informed of disruption to their experience of Stripe. The team is skilled in communications, incident handling and technical adeptness as incidents can arise from anywhere and cut across products and orgs in Stripe.
What you’ll do
As an Incident Response Manager (IRM), you’ll play a crucial role in driving the right level of response from Stripes to incidents, determining impact, rallying Stripes to mitigate, communicating to users and ensuring appropriate remediations and orchestrate the Root Cause Analysis (RCA) process. You’ll work closely with IRMs and responding teams globally to ensure solid 24/7 coverage on how we monitor, detect, respond, communicate and mitigate incidents. You’ll focus on developing your skills in incident management, communication, and technical understanding of Stripe’s products and services. When not managing incidents, you'll contribute to improving our operations. You’ll focus on developing your skills in incident management, communication, and technical understanding of Stripe’s products and services.
Responsibilities
Act as an Incident Commander for incidents across various classes (reliability, technical, data privacy, product, or security), driving incident resolution with urgency and cross-functional collaboration
Lead all user-facing incidents across domains at Stripe - including reliability, technical, security, and data privacy
'User First' approach to determine impact, providing accurate situation reports, facilitating comms bridges, and ensuring useful and timely external communications to users
Update internal stakeholders and support decision-making processes during incidents
Participate in the root cause analysis process, conduct post-mortems for routine incidents, and identify remediations
Collaborate with engineering, product, and operations teams to improve incident handling processes and tooling
Contribute to team culture and processes that enhance incident response capabilities
Who you are
We’re looking for someone who meets the minimum requirements to be considered for the role. If you meet these requirements, you are encouraged to apply. The preferred qualifications are a bonus, not a requirement.
Minimum requirements
3+ years of demonstrable major incident experience for organizations that run mission critical applications or always-on Saas environments.
Demonstrated ability to independently lead multiple incidents concurrently with minimal support and guidance from senior team members
Basic understanding of application development, architectures, and cloud environments
Familiarity with infrastructure concepts, including physical, virtual, and container-based compute platforms
Practical experience using modern monitoring and telemetry tools such as Splunk Prometheus, and Grafana
Basic data analysis skills using SQL, Splunk or other tools.
Strong task management skills, with attention to detail and ability to remain composed in high-pressure situations.
Good written and verbal English communication skills, with the ability to translate complex technical issues for various stakeholders.
Preferred qualifications
Familiarity with different types of incidents such as technical, privacy, security, or crisis with eagerness to continually learn about Stripe's products and systems.
Experience in conveying key details of technical issues to stakeholders
Experience with broad public-facing communications (e.g. status pages, tweets) and/or targeted communications (e.g. direct emails, support ticket responses).
Familiarity with distributed architectures and system inter-dependencies which operated in a cloud environment.