The role of Principal Service Reliability Designer involves analyzing and specifying solutions for assigned projects, ensuring their feasibility. This position requires close collaboration with PLEX teams, IT units, and business clients, providing support for functional tasks and ongoing project changes. The ultimate goal is to contribute to the financial stability of millions in Quebec by fostering a more sustainable future.
Design scalable and resilient SRE solutions on AWS infrastructure.,Define and implement strategies tailored to business needs.,Propose architectural enhancements to optimize costs, improve performance, and ensure high availability.,Collaborate with stakeholders to align SRE priorities with business objectives.,Contribute to the development of the technology roadmap and influence strategic operational decisions.,Train and mentor development and operations teams to integrate SRE practices into daily workflows.,Advocate for a culture of continuous improvement and interdisciplinary collaboration.,Develop and maintain CI/CD tools and pipelines to automate deployments and operations.,Automate infrastructure management and repetitive operational tasks.,Identify and resolve reliability, latency, and scalability issues within AWS environments.,Implement chaos engineering practices to test system resilience.,Oversee SLOs, SLIs, and SLAs to ensure service levels meet expectations.,Establish robust incident management processes and conduct post-mortems to document root causes.,Ensure diligent follow-up on corrective and preventative actions.
Bachelor's degree in Information Technology, Software Engineering, or a related field (or equivalent experience).,Five (5) years of relevant experience in operations management.,Five (5) years of relevant experience in SRE and key AWS services (EC2, S3, RDS, Lambda, Cloudwatch, Route 53, etc.).,Mastery of cloud architecture concepts: VPC, IAM, networking, security, etc.,Advanced knowledge of IaC concepts (Cloudformation, Terraform, etc.).,Ability to automate and manage infrastructures in complex environments.,Expertise with tools like Datadog, Cloudwatch, and ITOM SNOW.,Understanding of distributed tracing and centralized logs concepts.,Experience in using Azure DevOps, GitHub, Jenkins, Gitlab CI/CD, or similar tools.,In-depth knowledge of at least one (1) scripting language (Python, Bash) and one programming language (Go, Java, etc.).,Experience in managing critical incidents in production environments.,Thorough understanding of Chaos Engineering practices.,Ability to work in an Agile context.
Bachelor's Degree
Occasionally involved in critical incidents in the evening, at night, or on weekends.
CDPQ is a global investment group that manages funds for public pension and insurance plans. It invests in major financial markets, private equity, infrastructure, real estate, and private debt to generate long-term value for its depositors and the Quebec economy.
BerryMap uses cookies to provide essential features, analyze usage, and improve your experience. You can customize your preferences below.