RBC Borealis is seeking an experienced Machine Learning Platform Engineer to design and implement machine learning infrastructure and automation tools (MLOps and DevOps). This role involves deploying and operating GenAI platforms on Kubernetes/OpenShift, managing large language model deployments on GPU infrastructure, monitoring performance, implementing observability stacks, and building scalable on-premise systems for ML.
Deploying and operating the GenAI platform across OpenShift/Kubernetes.,Managing large language model deployments (Cohere Command, Llama, Mistral) on GPU infrastructure (NVIDIA A100/H100).,Configuring RAG pipelines with serving frameworks like vLLM, NVIDIA NIM, and TensorRT-LLM.,Monitoring GPU utilization, model performance metrics, and resource allocation.,Implementing observability stacks (Prometheus, Grafana, Pushgateway, structured logging pipelines) for platform health and security.,Designing and implementing best practices and standards for data and machine learning pipelines.,Supporting platform users and cross-functional teams through infrastructure design guidance, documentation, and collaboration.,Building highly scalable, resilient on-premise systems for hosting machine learning systems.,Strong experience designing and operating distributed/ML systems.,Deep Kubernetes/OpenShift knowledge (Helm, operators, custom resources, RBAC, troubleshooting).,Proven history building DevOps/CI/CD pipelines (GitHub Actions), multi-stage Docker images, registry mirroring, and infrastructure automation in restricted enterprise environments.,In-depth knowledge of various stages of the machine learning application deployment process.,Proficiency with programming languages such as Python, Bash, or Rust.,Solid grasp of software engineering best practices (testing, coding standards, code reviews, source control, production monitoring, alerting).,Hands-on experience building and deploying hybrid environments on-premises enterprise environments.,Familiarity with Large Language Model (LLM) inference and serving (e.g., VLLM).
Strong experience designing and operating distributed/ML systems.,Deep Kubernetes/OpenShift knowledge (Helm, operators, custom resources, RBAC, troubleshooting).,Proven history building DevOps/CI/CD pipelines (GitHub Actions), multi-stage Docker images, registry mirroring, and infrastructure automation in restricted enterprise environments.,In-depth knowledge of various stages of the machine learning application deployment process.,Proficiency with programming languages such as Python, Bash, or Rust.,Solid grasp of software engineering best practices (testing, coding standards, code reviews, source control, production monitoring, alerting).,Hands-on experience building and deploying hybrid environments on-premises enterprise environments.,Familiarity with Large Language Model (LLM) inference and serving (e.g., VLLM).
37.5 hours/week
Royal Bank of Canada is a global financial institution with a purpose-driven, principles-led approach to delivering leading performance. As Canada's largest bank, it provides personal and commercial banking, wealth management, and capital markets services to over 17 million clients worldwide.
BerryMap uses cookies to provide essential features, analyze usage, and improve your experience. You can customize your preferences below.