Responsibilities
-
Lead the management of production incidents and conduct thorough post-mortems to continually enhance system stability and reliability.
-
Partner closely with development teams to ensure seamless application and infrastructure deployments.
-
Maintain, optimize, and scale cloud environments (e.g., AWS, Alicloud) for performance, cost efficiency, and high availability.
-
Administer and manage Kubernetes clusters—primarily EKS—to support scalable web service deployments.
-
Design, implement, and maintain CI/CD pipelines using tools such as GitHub Actions, ArgoCD, and AI/LLM-powered automation.
-
Automate infrastructure provisioning and lifecycle management using Terraform, Python, and related tooling.
Requirements
-
At least 5 years of hands-on experience with AWS services (e.g., CloudFront, EKS, VPC, S3, ALB) and infrastructure-as-code tools such as Terraform or CloudFormation.
-
Strong practical expertise in Kubernetes, particularly managed services like EKS.
-
Proficiency in scripting languages such as Python, Shell/Bash, or Go.
-
Experience operating and optimizing systems for high-traffic, high-volume environments.
-
Familiarity with monitoring, logging, and performance tuning of distributed systems.
-
Hands-on experience with DevOps tools and platforms, including Terraform, Ansible, Docker, and Linux.
-
Solid understanding of networking protocols, architectures, and security best practices.
-
Strong analytical, problem-solving, and troubleshooting skills.
-
Bilingual proficiency in English and Mandarin to coordinate effectively with global teams and stakeholders.
Preferred
-
Practical experience applying AI/LLM models or automation tools to real-world operational challenges and deployment workflows.
-
A passion for exploring emerging technologies and integrating them into daily operations to drive innovation and continuous improvement.
