Observability Reliability Engineer
We are seeking an experienced Site Reliability Engineer (SRE) specialized in the Observability space to join our team. This role will be responsible for the design and implementation of observability solutions that ensure the reliable, performance, and scalable infrastructure. In addition, this role will involve reviewing our current observability stack, planning for future enhancements, implementing new solutions, and collaborating with developers to create actionable insights through effective dashboards and automated alerting systems. The ideal candidate will have a strong background in analytics and experience with advanced monitoring techniques to help us achieve metrics baselining, anomaly detection, and enhanced correlation and causation analysis.
Responsibilities
Conduct thorough reviews of our existing observability stack to identify areas for improvement and optimization
Collaborate with the team to plan and design the next version of our observability infrastructure
Assist in the implementation of the new observability stack, ensuring seamless integration and minimal disruption
Create and maintain insightful and actionable dashboards that provide clear visibility into system performance without adding unnecessary noise
Review existing alerts and work closely with developers to automate alert handlers for self-healing systems
Utilize your experience in analytics to perform metrics baselining and anomaly detection, ensuring our systems are operating optimally
Explore and integrate AI tools to enhance our correlation and causation analysis capabilities
Develop and maintain necessary components such as metrics exporters and self-service tools
Required Skills:
Demonstrated experience as a Site Reliability Engineer, Observability Engineer, or similar role in software development
Must have experience with Observability such as implementing monitoring, alerting and dashboarding solutions
Experience with alerts management and automation
Experience with custom metrics exporters, tracing tools
Experience with performance tools and optimization
Hands-on experience with the Prometheus ecosystem
Ability to design and develop code in Python or Go
Acute drive to automate manual operations and processes
Strong understanding of Linux operating systems
Hands-on experience with configuration management tools such as Ansible, SaltStack, or Terraform
Experience in managing and scaling distributed systems
Strong sense of ownership and integrity, demonstrated through clear communication and collaboration
Excellent troubleshooting and problem-solving skills
Ability to communicate complex concepts clearly with both stakeholders and developers