Responsibilities
- Lead production incident handling and conduct post-mortem analyses to drive system stability and continuous improvement.
- Design, deploy, monitor, and troubleshoot Kafka and Redis clusters in production environments, ensuring optimal performance, scalability, and reliability.
- Collaborate closely with development teams to ensure smooth, reliable, and automated application/system deployments.
- Manage and optimize cloud infrastructure (AWS / AliCloud) for performance, cost efficiency, and operational resilience.
- Build and enhance internal DevOps platforms, including online load-testing systems and change-management tools.
- Continuously explore and apply AI-driven insights to improve reliability, reduce alert noise, and enable intelligent decision-making across engineering operations.
- Bonus: Utilize LLMs and AI frameworks (OpenAI, Dify, Agno, LangChain) to automate DevOps workflows such as intelligent alert triage, root-cause analysis (RCA), and chat-based operations (ChatOps).
Requirements
- 5+ years of hands-on experience operating Kafka and Redis in large-scale production environments, with the ability to work with developers to optimize application code.
- Experience using or integrating tools such as Dify, Agno, or LangChain into operational or automation workflows.
- Proficiency in at least one programming language (Python or Go) and solid SQL skills.
- Strong hands-on experience with containerization and orchestration technologies (Docker, Kubernetes).
- Proficient with CI/CD and automation tools such as GitHub Actions, Ansible, Terraform, etc.
- Bonus: Experience designing or operating AIOps systems (e.g., anomaly detection, alert correlation, auto-healing, or RCA automation).
- Bonus: Familiarity with LLM-powered DevOps automation (e.g., ChatOps assistants, AI-driven observability workflows).
