Support Engineer - Incident Management, Aws Incident Response (Air), Aws Incident Response

Details of the offer

Support Engineer - Incident Management, AWS Incident Response (AIR)Job ID: 2833550 | Amazon Development Centre Ireland Limited
AWS Incident Response is at the heart of high availability of Amazon Web Services. We make customer impacting events shorter and less frequent by providing large scale event and incident management. Our automated tooling quickly identifies the cause of an issue and helps mitigate its impact, and much of our engineer time is spent on projects to improve the tooling and automation. We also provide manual incident management for AWS and other Amazon groups, directing the resolution of an issue with service teams, and diving deep into those events to drive improvements to the tooling. It's an exciting time to join our team as we are rapidly growing and expanding our offerings.
As a Support Engineer on the team you will lead projects and build processes to reduce the duration, frequency, and impact of issues within the AWS and Amazon infrastructure. You will also spend a portion of your time directing the resolution of high visibility incidents by leading conference calls and teams across the globe. Using data learned from those incidents you will drive further improvements into our automation, tooling, and processes so that the next event is shorter or avoided entirely. You will participate on project teams to expand use of our tooling to additional areas across Amazon. You'll also have the opportunity to grow your coding skills by taking on development projects matched to your ability level. If you're looking for a supportive team with great growth potential and an opportunity to make a huge impact, this is the team to join.
Key job responsibilitiesCritical Issue Resolution and Call Management: Act as the primary point of contact in a team rotation for customer impacting issues. Monitor performance graphs, drive resolution calls with a large number of service team members, and page additional engineers as needed until the root cause is identified. Please note this could include some weekends and holidays.Root Cause Analysis and Prevention: Identify and analyse recurring platform issues, leading projects to address root causes and implement long-term preventative measures.Automation and Efficiency Projects: Apply scripting and automation skills to projects that improve team efficiency and operational excellence, reducing manual work and streamlining incident resolution processes.Documentation and SOP Development: Design, create, and review documentation, including new standard operating procedures, to improve knowledge sharing and incident response speed.Mentorship and Knowledge Sharing: Provide mentorship to peers in technical troubleshooting and incident management best practices.Global Project Leadership: Lead cross-functional, global project teams to implement operational improvements and automation initiatives.BASIC QUALIFICATIONSTechnical Troubleshooting and Debugging: Proven experience in troubleshooting and resolving complex technical systems issues.Analytical Documentation Skills: Experience in documenting technical findings and analysis.Scripting Knowledge: Practical programming ability with at least one scripting language (e.g., Python, Shell Script, PowerShell, Ruby, etc) to automate routine tasks and improve efficiency.Technical Support Background: 3+ years of experience in technical support, incident response, or a related field.PREFERRED QUALIFICATIONSAdvanced Monitoring and Observability Skills: Experience with monitoring tools (e.g., CloudWatch, Datadog, Prometheus) for proactively identifying and resolving performance issues.Expertise in Incident Management and Call Facilitation: Demonstrated experience managing high-stakes, multi-participant incident calls, with the ability to communicate clearly and organize on-call team members effectively.CI/CD and Process Automation: Familiarity with CI/CD pipelines and automation best practices to continuously improve the team's deployment and incident management workflows.Collaboration and Cross-Team Communication: Strong skills in collaborating across technical teams, documenting incidents, and sharing findings with both technical and non-technical stakeholders to foster operational transparency.Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build.
Posted: November 12, 2024 (Updated about 17 hours ago)

#J-18808-Ljbffr


Nominal Salary: To be agreed

Source: Jobleads

Requirements

Community Operations Analyst - Zulu

Duties and Responsibilities Assist our client's community and help resolve inquiries empathetically, accurately and on time.Become and remain knowledgeable a...


Cpl Healthcare - County Dublin

Published 21 days ago

Edi & Wellbing Advisor

Our professional services client in Dublin city centre has an exciting requirement for a Diversity & Inclusion Advisor, to join their large HR team, for a pe...


Cpl Healthcare - County Dublin

Published 21 days ago

Senior Full Stack Developer – Java / Angular

Location: Dublin / Galway – Hybrid Contract Type: Long Term Contract Compensation: Negotiable Day Rates A global financial services and investment organizati...


Cpl Healthcare - County Dublin

Published 5 days ago

Network Development Engineer Ii, Adc Networking International

Job ID: 2748281 | Amazon Data Services Ireland Limited Would you like to be an Engineer that builds the Cloud, rather than an Engineer that just uses it? At ...


Amazon - County Dublin

Published 5 days ago

Built at: 2024-11-22T09:41:44.381Z