Machine Learning Operations (MLOps) Engineer

Astana, KZ

Nace.AI is a hive for people who see a tough problem and itch to solve it with AI. We mix curiosity, rigor, and play to turn raw research into tools that feel like magic.

Role Overview:

In this role, you will build and maintain the infrastructure and workflows that enable scalable, reliable, and efficient deployment of machine learning models. You will work on end-to-end model lifecycle management, from training pipelines to production deployment, monitoring, and optimization. Your efforts will ensure that Nace.AI models operate at peak performance while maintaining robustness, reproducibility, and compliance with best practices.

Key Responsibilities:

Design, implement, and maintain robust ML pipelines for model training, validation, deployment, and monitoring.
Automate end-to-end model lifecycle processes, including versioning, testing, and continuous integration/continuous deployment (CI/CD) for ML models.
Ensure reliability, scalability, and performance of deployed models in multi-node GPU/TPU environments.
Collaborate with ML engineers and researchers to streamline workflows for synthetic data generation, model training, and fine-tuning.
Implement monitoring, alerting, and logging frameworks for production ML systems to ensure high availability and performance.
Optimize resource utilization and operational efficiency across distributed ML infrastructure.

Qualifications:

Hands-on experience deploying, scaling, and maintaining machine learning models in production.
Practical experience with MLOps tools and frameworks (e.g., MLflow, Kubeflow, Airflow, TensorFlow Serving, TorchServe, Docker, Kubernetes).
Strong programming skills in Python and experience with building automation pipelines and CI/CD for ML.
Solid foundation in computer science fundamentals (data structures, algorithms, system design).
BS degree in CS, Software Engineering, or a related technical field.
Experience with cloud platforms (AWS, GCP, Azure) and distributed computing environments.
Self-starter comfortable working in fast-paced, dynamic environments.

Preferred Qualifications:

MS/PhD in CS, Software Engineering, or related technical field.
Experience with distributed training and inference of LLMs and VLMs.
Familiarity with observability and monitoring tools (Prometheus, Grafana, Sentry).
Contributor to open-source MLOps or ML infrastructure projects

Why Nace.AI?

Pedigree: Work with a team from top-tier institutions and companies, backed by the best VCs in the world.

Impact: You are joining early enough to shape the culture and the growth roadmap of a company aiming to be the "OS" for professional knowledge.

Competitive Package: Silicon Valley-standard salary, significant equity, and premium benefits.

Apply: send CV and a brief note on a complex audit you led to career@nace.ai.

Solutions