Job Description
RESPONSIBILITIES:
- Strong communication skills for detailed analysis and clear explanations to executive-level stakeholders.
- Drive long-term solutions for high-impact production issues across technical, operations, and product teams.
- Lead process improvement initiatives and provide feedback to management.
- Respond to and resolve incidents from external and internal users promptly.
- Conduct thorough analysis of issues and provide timely updates.
- Perform user maintenance and fulfill Standard Service Requests.
- Conduct routine testing, monitoring, and respond to system issues.
- Maintain updates on tickets, monitor progress, and coordinate issue resolution.
- Identify and escalate urgent issues, participate in incident and problem management.
- Maintain support documentation and ensure compliance with standard practices.
- Manage client-facing applications and implement production system alerts.
- Support continuous improvement and lead automation projects.
- Monitor service-level dashboards and perform system capacity reviews.
- Participate in Root Cause Analysis (RCA) investigations and Site Reliability Engineering (SRE) tasks.
- Generate daily reports on ticket status and share knowledge with peers.
Qualifications:
- Bachelor’s degree in engineering, computer science, or related field.
- Hands-on experience in programming and web services.
- Proficient in monitoring, observability, and data processing techniques.
- Experience with cloud platforms (GCP, AWS, Azure) and application monitoring tools (Datadog, Logic Monitor, PagerDuty).
- Ability to create and analyze Datadog Dashboards.
- Exceptional communication skills.
- Flexibility to work 24×7 shifts, including weekends.
- Application support experience in Microsoft Azure Cloud environment.
- Proficiency in Microsoft Office Tools, SQL, Unix, and Analytic Reporting Suite.
- Experience in Global Command Center and familiarity with ITIL and ITSM practices.