Principal responsibilities:
-Design, implement, maintain highly available, scalable infrastructure solutions, leveraging automation to streamline operations.
-Monitsystem performance, proactively identify potential issues, drive incident response root cause analysis.
-Collaborate with cross-functional teams (development, product, security) to integrate reliability best practices the entire software lifecycle.
-Develop manage automation scripts, CI/CD pipelines, infrastructure-as-code (IaC) frameworks to enhance efficiency reduce manual intervention.
-Optimize cloud resources, cost management, disaster recovery strategies to ensure business continuity.
Qualifications :
-Experience: Minimum 5 years in IT operations Site Reliability Engineering, with a focus on infrastructure management system optimization.
-Technical Skills: Proficiency in operation control tools such as Ansible, Puppet, Chef, Terraform, Prometheus, Grafana, ELK Stack.
-Strong scripting skills in Python, Shell, similar languages.
Cloud Competency: Solid experience with majcloud platforms (AWS, Azure, GCP), including services like EC2, Lambda, Kubernetes, containerization.
-Problem-Solving: Proven ability to troubleshoot complex issues across distributed systems, networks, applications.
-Communication: Excellent written verbal communication skills, with the ability to collaborate effectively in a fast-paced, dynamic environment.
Preferred Qualifications:
-3+ years of dedicated experience in cloud service operations, with expertise in cloud-native architectures microservices.
-Certifications in AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect, equivalent.
-Experience with service mesh technologies (e.g., Istio) observability tools (e.g., Jaeger).
-Familiarity with DevOps culture practices, including agile methodologies continuous improvement frameworks.
-Bonus: Proven experience in developing IT operation maintenance tools using Python, demonstrating the ability to automate complex workflows solve real - world problems.
更新于 2025-12-16
查看更多崗位職責(zé)
崗位職責(zé):
1. 負(fù)責(zé)大模型平臺(tái)的基礎(chǔ)設(shè)施(包括不限于系統(tǒng)、存儲(chǔ)、通信等方面)國(guó)產(chǎn)化適配開發(fā)自測(cè)和維護(hù),跟蹤相關(guān)組件和業(yè)界相關(guān)技術(shù)演進(jìn);
2. 負(fù)責(zé)大模型平臺(tái)產(chǎn)品的持續(xù)集成, 部署交付, 現(xiàn)場(chǎng)問(wèn)題定位修復(fù)等相關(guān)工作;
3. 負(fù)責(zé)基于 k8s底座開發(fā)和維護(hù), 提升系統(tǒng)穩(wěn)定性和智能化運(yùn)維, 系統(tǒng)可觀測(cè)性建設(shè);
4. 優(yōu)化CI/CD流程和相關(guān)工具,輸出設(shè)計(jì)文檔和開發(fā)運(yùn)維操作指南。
任職要求:
1. 計(jì)算機(jī)科學(xué)、軟件工程或相關(guān)專業(yè)本科及以上學(xué)歷;
2. 熟悉Docker、Kubernetes等容器和集群管理技術(shù);
3. 熟練使用至少一種編程語(yǔ)言如Python,Golang, 參與過(guò)大型分布式項(xiàng)目/產(chǎn)品的開發(fā);
4. 有 k8s operat開發(fā)或國(guó)產(chǎn)加速卡適配運(yùn)維相關(guān)經(jīng)驗(yàn)者優(yōu)先;
5. 良好的團(tuán)隊(duì)協(xié)作能力和問(wèn)題解決能力,充滿激情,富有責(zé)任心和自驅(qū)力。
更新于 2026-01-17
查看更多崗位職責(zé)