Site Reliability Engineer / Senior Site Reliability Engineer, Reliability | MLOps Remote, EMEA
The GitLab DevOps platform empowers 100,000+ organizations to deliver software faster and more efficiently. We are one of the world’s largest all-remote companies with 2,000+ team members and values that guide a culture where people embrace the belief that everyone can contribute.
Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments and the GitLab codebase. We specialize in systems, whether it be networking, the Linux kernel, or some more specific interest in scaling, algorithms, or distributed systems.
GitLab.com is a unique site and it brings unique challenges–it’s the biggest GitLab instance in existence. In fact, it’s one of the largest single-tenancy open-source SaaS sites on the internet. The experience of our team feeds back into other engineering groups within the company, as well as to GitLab customers running self-managed installations
- Automating every operational task is a core requirement for this role. For example, package updates, configuration changes across all environments, creating tools for automatic provisioning of user facing services, etc.
- Responding to platform emergencies, alerts, and escalations from Customer Support.
- Ensure systems exist to manage software life-cycles (e.g. Operating Systems) with a minimum of manual effort.
- Develop a fully automated multi-environment observability stack based on the existing SaaS system, and extend it to predict capacity needs based on the usage patterns.
- Plan for new service roll-outs, expansion and capacity management of existing services, and work with users to optimise their resource consumption.
As an SRE you will:
- Be on a PagerDuty rotation to respond to GitLab.com availability incidents and provide support for service engineers with customer incidents.
- Analyze existing, create and maintain new GitLab.com Service Level Objectives.
- Troubleshoot, evaluate and resolve operational challenges contributing to defined SLO's.
- Define, improve, and engage in adapting architectural application bottlenecks as observed on GitLab.com.
- Work with other engineering stakeholders on resolving larger architectural bottlenecks and participate by offering GitLab.com point of view.
- Test, deploy, maintain and improve ML infrastructure and software that uses these models
- Work in close collaboration with software development teams to shape the future roadmap and establish strong operational readiness across teams.
- Scale systems through automation, improving change velocity and reliability.
- Leverage technical skills to partner with team members and be comfortable diving into a problem as needed.
- Work with counterparts in other teams of the Infrastructure department to improve infrastructure running with Chef, Terraform and Kubernetes.
- Make monitoring and alerting alert on symptoms and not on outages.
- Document every action so your findings turn into repeatable actions–and then into automation.
- Debug production issues across services and levels of the stack.
You may be a fit to this role if you:
- Are able to reason about large systems - how they work on large scale, edge cases, failure modes, behaviors.
- Know your way around Linux and the Unix Shell.
- Have experience in collaborating and communicating asynchronously.
- Have significant professional experience in Python backend infrastructure.
- Experience in working with pytorch , TF infrastructure and other similar frameworks
- Help improve ML features scalability and maintainability
- Collaborate with other ML engineers and advise on the MLOps architect from infrastructure prospective
- Aid in integrating of every ML feature with Gitlab
- High interest in defining infrastructure for large scale ML recommendation engines (experience with this, however is a nice-to-have).
- Comfort working in earlier stages of product development.
- A genuine passion for learning.
- Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, or similar technologies.
- Are able to leverage GitLab as your day to day go-to tool.
Nice to have attributes:
- Research or Industry experience in ML Engineering
- Experience with Kubernetes and MLFlow or Kubeflow or similar MLOps stack
- Experience with cloud architecture optimization (GDF, PubSub, GCP).
- Experience in Continuous training of models
About the group: ModelOps
The ModelOps Stage comprises two groups Applied ML and MLOps. We have recently released the beta version of our first Applied ML feature, Suggested Reviewer. We are working on our first feature as part of MLOps , Model Registry. We will be expanding our use cases in the Future
This team primarily writes code in Python , Golang, Ruby on Rails, Vue.js.
Senior Site Reliability Engineer Criteria
- Deep knowledge in 2 areas of expertise and general knowledge of all areas of expertise. Capable of mentoring Junior in all areas and other SRE in their area of deep knowledge.
- Contributes small improvements to the GitLab codebase to resolve issues
- Identifies significant projects that result in substantial cost savings or revenue
- Identifies changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
- Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make GitLab cheaper to run for all our customers.
- Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.
- Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.
Collaboration and Communication:
- Know a domain really well and radiate that knowledge
- Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.
Influence and Maturity:
- Lead Production SREs and Junior Production SREs by setting the example.
- Show ownership of a major part of the infrastructure.
- Trusted to de-escalate conflicts inside the team
Site Reliability Engineers have the following job-family performance indicators:
- GitLab.com Availability
- GitLab.com Performance
- Apdex and Error SLO per Service
- Mean Time to Detection
- Mean Time to Resolution
- Mean Time Between Failure
- Mean Time to Production
- Disaster Recovery Time to Recovery
Please view the compensation range for this role at the bottom of the position description.
The base salary range for this role’s listed level is currently $100,800 - $183,600 for Colorado residents and $100,800 - $205,200 for New York and New Jersey residents only. Grade level and salary ranges are determined through interviews and a review of education, experience, knowledge, skills, abilities of the applicant, equity with other team members, and alignment with market data. See more information on our benefits and equity. Sales roles are also eligible for incentive pay targeted at up to 100% of the offered base salary.
Country Hiring Guidelines: GitLab hires new team members in countries around the world. All of our roles are remote, however some roles may carry specific location-based eligibility requirements. Our Talent Acquisition team can help answer any questions about location after starting the recruiting process.
GitLab is proud to be an equal opportunity workplace and is an affirmative action employer. GitLab’s policies and practices relating to recruitment, employment, career development and advancement, promotion, and retirement are based solely on merit, regardless of race, color, religion, ancestry, sex (including pregnancy, lactation, sexual orientation, gender identity, or gender expression), national origin, age, citizenship, marital status, mental or physical disability, genetic information (including family medical history), discharge status from the military, protected veteran status (which includes disabled veterans, recently separated veterans, active duty wartime or campaign badge veterans, and Armed Forces service medal veterans), or any other basis protected by law. GitLab will not tolerate discrimination or harassment based on any of these characteristics. See also GitLab’s EEO Policy and EEO is the Law. If you have a disability or special need that requires accommodation, please let us know during the recruiting process.
Vacancy page : https://boards.greenhouse.io/gitlab/jobs/6496686002