Site Reliability Engineer
About the role
Working in the Site Reliability Engineering team, you’ll be helping ensure the stability, resilience and scale of our services through automation, observability and infrastructure engineering. The work is varied; from helping engineering teams deploy monitoring, to designing and implementing new SRE tools and techniques, our team is proactive and always involved. We are a fast moving team operating in a growing Fintech company, supporting engineers on three continents. We use a modern DevOps and SRE tech stack –Github Actions, K8s, ArgoCD, Grafana, AWS, Terraform, and Agile working practices to get the job done. As a member of Zepz’s SRE team you will aim high, embrace challenges and always do what’s right; acting with integrity and building trust as you contribute to the company’s technical direction and long term decision making.
Reporting to the SRE Manager you will:
Use code to solve problems. configuration, infrastructure, tooling, and automation, everything must be solved by writing high quality code that performs and scales.
Using best practices and standards in regards to Observability, Monitoring, Alerting, Capacity Planning, availability, performance/latency, change, troubleshooting for all our Tech services.
Work closely with feature teams to ensure that services are correctly monitored, change is delivered in a safe and secure way, resilience is built into our product and our standards and best practices adopted.
Lead or be involved in the troubleshooting of complex incidents and problems.
Have visibility on end to end service to our customers and ensure their journey is stable and consistent across all the microservices and 3rd party dependencies with the observability tool you will have implemented with the Engineering teams.
Helping the team meet its strategic goals; to maintain the highest level of observability, maximize developer velocity while keeping our product reliable, and ensure that we can deliver the highest quality experience to our customers.
Growing together. You’ll review others' work and happily seek feedback on yours to ensure we build a better codebase and sharpen each other's skills.
What we’re looking for from you
A skilled Engineer. At least 5 years in SRE, DevOps or Engineer role with a keen interest in solving problems using automation.
Understand SRE and DevOps methodologies. You understand the build and deployment cycle of an application, and how to operate a resilient system.
A focus on observability. Observability is key to operating a truly reliable and scalable system. We are looking for engineers who can 'Monitor Everything & Measure Everything', driving a culture of observability. Experience with Grafana, Loki and Prometheus.
Holistic view on application delivery. You understand the use of many systems; monitoring, logging, alerting, and scaling. To build a robust platform which can respond to varying demands from both external sources (traffic) and internal sources (feature team delivery) in a safe and controlled manner. You have experience supporting or developing applications written in Java, Python or node.js.
Systematic problem-solving approach. You should have an understanding of how to analyze, and troubleshoot large-scale distributed systems.
Happy in the Clouds. Our Cloud Native platform is hosted on AWS. You’ll be comfortable working with a system that supports users from around the world, at scale.
Bias for action. You see a problem, you fix a problem. You get buy-in for your solutions and keep tickets moving. We’re always looking for ways to ship at pace.
Growth mindset. A willingness to use your skills and experience to mentor less-experienced engineers. A desire to learn from others and make yourself better every day.
Agile outlook. You need to be excited about working in a fast-changing environment. Products, tools, frameworks and processes change, we evolve and take the best bits with us. The teams drive the evolution.
Disciplined and self managed. You need to own your role and be disciplined about adhering to protocols and processes. As a senior you will always ensure you are bringing value to the team and driving tasks to completion without being actively managed.
Bonus points if you:
Have experience working in a FinTech space
Have experience working in a distributed team across different geographies and timezones
About the job
Apply for this position
Site Reliability Engineer
About the role
Working in the Site Reliability Engineering team, you’ll be helping ensure the stability, resilience and scale of our services through automation, observability and infrastructure engineering. The work is varied; from helping engineering teams deploy monitoring, to designing and implementing new SRE tools and techniques, our team is proactive and always involved. We are a fast moving team operating in a growing Fintech company, supporting engineers on three continents. We use a modern DevOps and SRE tech stack –Github Actions, K8s, ArgoCD, Grafana, AWS, Terraform, and Agile working practices to get the job done. As a member of Zepz’s SRE team you will aim high, embrace challenges and always do what’s right; acting with integrity and building trust as you contribute to the company’s technical direction and long term decision making.
Reporting to the SRE Manager you will:
Use code to solve problems. configuration, infrastructure, tooling, and automation, everything must be solved by writing high quality code that performs and scales.
Using best practices and standards in regards to Observability, Monitoring, Alerting, Capacity Planning, availability, performance/latency, change, troubleshooting for all our Tech services.
Work closely with feature teams to ensure that services are correctly monitored, change is delivered in a safe and secure way, resilience is built into our product and our standards and best practices adopted.
Lead or be involved in the troubleshooting of complex incidents and problems.
Have visibility on end to end service to our customers and ensure their journey is stable and consistent across all the microservices and 3rd party dependencies with the observability tool you will have implemented with the Engineering teams.
Helping the team meet its strategic goals; to maintain the highest level of observability, maximize developer velocity while keeping our product reliable, and ensure that we can deliver the highest quality experience to our customers.
Growing together. You’ll review others' work and happily seek feedback on yours to ensure we build a better codebase and sharpen each other's skills.
What we’re looking for from you
A skilled Engineer. At least 5 years in SRE, DevOps or Engineer role with a keen interest in solving problems using automation.
Understand SRE and DevOps methodologies. You understand the build and deployment cycle of an application, and how to operate a resilient system.
A focus on observability. Observability is key to operating a truly reliable and scalable system. We are looking for engineers who can 'Monitor Everything & Measure Everything', driving a culture of observability. Experience with Grafana, Loki and Prometheus.
Holistic view on application delivery. You understand the use of many systems; monitoring, logging, alerting, and scaling. To build a robust platform which can respond to varying demands from both external sources (traffic) and internal sources (feature team delivery) in a safe and controlled manner. You have experience supporting or developing applications written in Java, Python or node.js.
Systematic problem-solving approach. You should have an understanding of how to analyze, and troubleshoot large-scale distributed systems.
Happy in the Clouds. Our Cloud Native platform is hosted on AWS. You’ll be comfortable working with a system that supports users from around the world, at scale.
Bias for action. You see a problem, you fix a problem. You get buy-in for your solutions and keep tickets moving. We’re always looking for ways to ship at pace.
Growth mindset. A willingness to use your skills and experience to mentor less-experienced engineers. A desire to learn from others and make yourself better every day.
Agile outlook. You need to be excited about working in a fast-changing environment. Products, tools, frameworks and processes change, we evolve and take the best bits with us. The teams drive the evolution.
Disciplined and self managed. You need to own your role and be disciplined about adhering to protocols and processes. As a senior you will always ensure you are bringing value to the team and driving tasks to completion without being actively managed.
Bonus points if you:
Have experience working in a FinTech space
Have experience working in a distributed team across different geographies and timezones