Sr. Site Reliability Engineer I
Position Summary:
The Sr. Site Reliability Engineer I is a proactive, disciplined, and collaborative individual with a focus on ensuring the reliability and performance of Pax8 services. They work on enabling teams with observability solutions, supporting the development lifecycle, and maintaining robust cloud infrastructure. The Engineer is involved in developing tools and processes to enhance service stability and availability. They bring a development mindset, excellent debugging skills, and a strong focus on automation and operational excellence. Their goal is to simplify system complexity by standardizing and streamlining technical solutions, maintaining consistency in common patterns, and minimizing the sprawl of redundancy to gain a high level of consistency.
Essential Responsibilities:
Increase developer velocity and system reliability by utilizing software development expertise, collaborating with engineering teams to address reliability concerns, analyzing the sources of issues and the impact on Cloud infrastructure to help the engineering community to work in a reliable, scalable environment (25%)
Standardize and implement baseline visibility across systems. Leverage programmatic monitoring to proactively address visibility gaps. Collaborate with teams to embed observability in the design phase, ensuring resilient and dependable systems. (20%)
Collaborate with Architecture and Platform teams to design automated solutions that eliminate repetitive tasks, enhance self-healing capabilities, improve service reliability, and enable developers to focus on delivering product features using proven, predictable frameworks (15%)
Prioritize security by collaborating with the engineering community to implement secure solutions, address issues proactively and reactively, and use lessons learned to establish best practices that minimize disruptions to product development (15%)
Elevate team capabilities through mentorship, project work assistance, design guidance, and participation in support and on-call rotations (15%)
Participate in incident response and post-incident analysis to drive improvements in system reliability by contributing to rapid recovery, conducting root cause analysis, and implementing changes based on post-mortem findings. (10%)
Ideal Skills, Experience, and Competencies:
At least five (5) to eight (8) years of experience supporting application development, preferably microservices and Java based web platforms.
Substantial, proven, software development experience.
Ability to show advanced proficiency of a relevant programming language (Java, TypeScript, Python, Groovy, Kotlin, VueJS, etc..
Advanced experience with one or more of the following frameworks (Spring, Spring Boot, JUnit, Mockito, Kotest, Stripe, Kafka, ElasticSearch, Netsuite, Oauth).
Experience using AI within the SDLC to quickly deliver reliable solutions
Strong experience with observability platforms, such as New Relic, Sumologic, Honeycomb, and similar tools to track performance and detect issues
Solid understanding of core AWS services, including EKS, RDS, and MSK (Azure knowledge is a plus)
Extensive experience with container technologies such as Docker and Kubernetes, with an emphasis on operational reliability
Proficient in Tomcat, Groovy, Kotlin, and Spring
Proven experience in debugging and troubleshooting applications, using both manual and automated methods
Database and SQL development experience
Understanding of IaC and configuration management using Terraform and Git
Understanding of CI/CD pipelines using GitHub Actions and ArgoCD
Experience working in a Lean/Agile environment using tools such as Jira, ClickUp, Asana or similar
Focus on meeting project commitments with predictability and urgency
Strong desire for automation
Ability to build strong customer relationships and deliver customer-centric solutions
Ability to take on new opportunities and tough challenges with a sense of urgency, high energy, and enthusiasm.
Ability to gain the confidence and trust of others through honesty, integrity, and authenticity
Ability to maneuver comfortably through complex policy, process, and people-related organizational dynamics
Ability to anticipate and adopt innovations in business-building digital and technology applications
Required Education & Certifications:
B.A./B.S. in related field or equivalent work experience
Compensation:
Qualified candidates can expect a compensation range of $125,000 to $155,000 or more depending on experience.
Expected Closing Date: 2/7/2025
#LI-Remote #LI-DS1 #BI-Remote #DICE-D
About the job
Apply for this position
Sr. Site Reliability Engineer I
Position Summary:
The Sr. Site Reliability Engineer I is a proactive, disciplined, and collaborative individual with a focus on ensuring the reliability and performance of Pax8 services. They work on enabling teams with observability solutions, supporting the development lifecycle, and maintaining robust cloud infrastructure. The Engineer is involved in developing tools and processes to enhance service stability and availability. They bring a development mindset, excellent debugging skills, and a strong focus on automation and operational excellence. Their goal is to simplify system complexity by standardizing and streamlining technical solutions, maintaining consistency in common patterns, and minimizing the sprawl of redundancy to gain a high level of consistency.
Essential Responsibilities:
Increase developer velocity and system reliability by utilizing software development expertise, collaborating with engineering teams to address reliability concerns, analyzing the sources of issues and the impact on Cloud infrastructure to help the engineering community to work in a reliable, scalable environment (25%)
Standardize and implement baseline visibility across systems. Leverage programmatic monitoring to proactively address visibility gaps. Collaborate with teams to embed observability in the design phase, ensuring resilient and dependable systems. (20%)
Collaborate with Architecture and Platform teams to design automated solutions that eliminate repetitive tasks, enhance self-healing capabilities, improve service reliability, and enable developers to focus on delivering product features using proven, predictable frameworks (15%)
Prioritize security by collaborating with the engineering community to implement secure solutions, address issues proactively and reactively, and use lessons learned to establish best practices that minimize disruptions to product development (15%)
Elevate team capabilities through mentorship, project work assistance, design guidance, and participation in support and on-call rotations (15%)
Participate in incident response and post-incident analysis to drive improvements in system reliability by contributing to rapid recovery, conducting root cause analysis, and implementing changes based on post-mortem findings. (10%)
Ideal Skills, Experience, and Competencies:
At least five (5) to eight (8) years of experience supporting application development, preferably microservices and Java based web platforms.
Substantial, proven, software development experience.
Ability to show advanced proficiency of a relevant programming language (Java, TypeScript, Python, Groovy, Kotlin, VueJS, etc..
Advanced experience with one or more of the following frameworks (Spring, Spring Boot, JUnit, Mockito, Kotest, Stripe, Kafka, ElasticSearch, Netsuite, Oauth).
Experience using AI within the SDLC to quickly deliver reliable solutions
Strong experience with observability platforms, such as New Relic, Sumologic, Honeycomb, and similar tools to track performance and detect issues
Solid understanding of core AWS services, including EKS, RDS, and MSK (Azure knowledge is a plus)
Extensive experience with container technologies such as Docker and Kubernetes, with an emphasis on operational reliability
Proficient in Tomcat, Groovy, Kotlin, and Spring
Proven experience in debugging and troubleshooting applications, using both manual and automated methods
Database and SQL development experience
Understanding of IaC and configuration management using Terraform and Git
Understanding of CI/CD pipelines using GitHub Actions and ArgoCD
Experience working in a Lean/Agile environment using tools such as Jira, ClickUp, Asana or similar
Focus on meeting project commitments with predictability and urgency
Strong desire for automation
Ability to build strong customer relationships and deliver customer-centric solutions
Ability to take on new opportunities and tough challenges with a sense of urgency, high energy, and enthusiasm.
Ability to gain the confidence and trust of others through honesty, integrity, and authenticity
Ability to maneuver comfortably through complex policy, process, and people-related organizational dynamics
Ability to anticipate and adopt innovations in business-building digital and technology applications
Required Education & Certifications:
B.A./B.S. in related field or equivalent work experience
Compensation:
Qualified candidates can expect a compensation range of $125,000 to $155,000 or more depending on experience.
Expected Closing Date: 2/7/2025
#LI-Remote #LI-DS1 #BI-Remote #DICE-D