Posted 3 months ago

To apply for this job please visit www.binance.com.

Lead production incident handling and conduct post-mortem analyses to drive system stability and continuous improvement.
Design, deploy, monitor, and troubleshoot Kafka and Redis clusters in production environments, ensuring optimal performance, scalability, and reliability.
Collaborate closely with development teams to ensure smooth, reliable, and automated application/system deployments.
Manage and optimize cloud infrastructure (AWS / AliCloud) for performance, cost efficiency, and operational resilience.
Build and enhance internal DevOps platforms, including online load-testing systems and change-management tools.
Continuously explore and apply AI-driven insights to improve reliability, reduce alert noise, and enable intelligent decision-making across engineering operations.
Bonus: Utilize LLMs and AI frameworks (OpenAI, Dify, Agno, LangChain) to automate DevOps workflows such as intelligent alert triage, root-cause analysis (RCA), and chat-based operations (ChatOps).

5+ years of hands-on experience operating Kafka and Redis in large-scale production environments, with the ability to work with developers to optimize application code.
Experience using or integrating tools such as Dify, Agno, or LangChain into operational or automation workflows.
Proficiency in at least one programming language (Python or Go) and solid SQL skills.
Strong hands-on experience with containerization and orchestration technologies (Docker, Kubernetes).
Proficient with CI/CD and automation tools such as GitHub Actions, Ansible, Terraform, etc.
Bonus: Experience designing or operating AIOps systems (e.g., anomaly detection, alert correlation, auto-healing, or RCA automation).
Bonus: Familiarity with LLM-powered DevOps automation (e.g., ChatOps assistants, AI-driven observability workflows).