Responsibilities

Lead the management of production incidents and conduct thorough post-mortems to continually enhance system stability and reliability.
Partner closely with development teams to ensure seamless application and infrastructure deployments.
Maintain, optimize, and scale cloud environments (e.g., AWS, Alicloud) for performance, cost efficiency, and high availability.
Administer and manage Kubernetes clusters—primarily EKS—to support scalable web service deployments.
Design, implement, and maintain CI/CD pipelines using tools such as GitHub Actions, ArgoCD, and AI/LLM-powered automation.
Automate infrastructure provisioning and lifecycle management using Terraform, Python, and related tooling.

Requirements

At least 5 years of hands-on experience with AWS services (e.g., CloudFront, EKS, VPC, S3, ALB) and infrastructure-as-code tools such as Terraform or CloudFormation.
Strong practical expertise in Kubernetes, particularly managed services like EKS.
Proficiency in scripting languages such as Python, Shell/Bash, or Go.
Experience operating and optimizing systems for high-traffic, high-volume environments.
Familiarity with monitoring, logging, and performance tuning of distributed systems.
Hands-on experience with DevOps tools and platforms, including Terraform, Ansible, Docker, and Linux.
Solid understanding of networking protocols, architectures, and security best practices.
Strong analytical, problem-solving, and troubleshooting skills.
Bilingual proficiency in English and Mandarin to coordinate effectively with global teams and stakeholders.

Practical experience applying AI/LLM models or automation tools to real-world operational challenges and deployment workflows.
A passion for exploring emerging technologies and integrating them into daily operations to drive innovation and continuous improvement.