Job Description
Site Reliability Engineer (SRE)
Location: On-site – Singapore
Industry: AI Infrastructure / Deep Tech
Represented by: Our client, a scaling AI technology company
Language Requirement: Bilingual – English and Chinese (Mandatory)
About the Opportunity
We are representing a forward-leaning AI infrastructure company developing scalable, high-performance platforms for intelligent systems. Their products serve real-time AI applications across enterprise and commercial markets.
As part of their continued growth, they are seeking a Site Reliability Engineer with proven experience in AI development environments, Agentic AI frameworks, or AI labs within tech or commercial companies.
Role Summary
This is a hands-on, on-site role where you’ll maintain and scale production-grade infrastructure, support real-time deployment of intelligent services, and ensure platform stability under high loads.
Key Responsibilities
- Operate and scale container-based infrastructure (Kubernetes/Docker) in production.
- Maintain CI/CD pipelines using GitLab CI, ArgoCD.
- Implement observability tools: logging, monitoring, and alerting systems.
- Automate operational tasks with Shell/Python scripts.
- Troubleshoot critical incidents, perform root cause analysis, and implement recovery strategies.
- Work with engineering teams to embed infrastructure-as-code and reliability best practices.
- Participate in 24/7 on-call rotations for system support.
Requirements
- Minimum 5 years of experience in SRE, DevOps, or platform engineering.
- Proven background in managing infrastructure for AI-driven systems or agent-based platforms.
- Strong hands-on experience with Kubernetes, Docker, and major cloud providers (AWS/GCP/Azure).
- Proficiency in Linux systems and scripting languages (Shell, Python).
- Deep understanding of services like Nginx, Redis, Kafka, ElasticSearch, MySQL.
- Fluent in both English and Chinese (written and spoken) – required for team collaboration and system documentation.
- Must be willing to work on-site in Singapore.
What’s Offered
- High-impact engineering role with direct access to core AI infrastructure.
- Collaborate in a multilingual, technically elite team.
- Competitive salary, performance-based incentives.
- Opportunity to shape platform reliability for next-gen AI systems