Lead(Staff) Site Reliability Engineer, Resiliency (Remote, Pacific Time) Americas
About the role
The Resiliency team is part of the Production Engineering organization that builds, operates, and improves the heart of Shopify’s technical platform, and unlock the power of planet-scale infrastructure for all of Shopify’s merchants, buyers, and developers.
Shopify has many critical components, and sometimes they fail. Members of our Resiliency Team are the ones ensuring we can get back to green as fast as possible when that happens. Resiliency set the foundation for building and running resilient systems at Shopify. This is a team of engineers with in-depth operational knowledge of the entire Shopify stack, and who act as first responders and leaders during an incident.
Our job is to get to a resolution as quickly as possible, and guide teams to build a more resilient Shopify. We build whatever is necessary to quickly resolve incidents, and seek out ways to automate away the manual toil.
Commerce happens 24/7, and we are building out a globally distributed team that can respond whenever necessary. Our team hires across 4 different regions (APAC, North America West, North America East, and EMEA) in a follow-the-sun support model that also provides 24/7 coverage for incident management.
For this Lead / Staff Production Engineer role, we welcome remote candidates based anywhere in Hawaii or Pacific Time zone. Working hours skew toward Hawaii Standard Time (-10:00 UTC). Relocation is possible for the right candidate 🏄🏾♀️
What we can offer you:
- The opportunity to run Shopify’s planet scale systems by enabling engineering teams to create resilient systems.
- Work focusing on a unique set of interesting and challenging problems that can’t be easily found elsewhere.
- The flexibility to define what Resiliency and Site Reliability Engineering mean for Shopify.
The means to grow the capacity of our worldwide distributed site reliability engineering teams, and consult with other engineering groups on how to build low latency, highly resilient systems.
- A direct impact on our millions of merchants’ ability to generate revenue for their livelihood, their families, and their employees through the business they’ve built from the ground up on our platform.
- Potential relocation assistance to one of the regions the team operates in.
You’ll work on things like:
- Collaborating with high-calibre engineering teams across Shopify to help them create resilient systems.
- Acting as a force multiplier across and within engineering departments.
- Managing ongoing incidents, using your understanding of Shopify to involve the right teams and resolve as quickly as possible.
- Cleaning up the noise in our signals, ensuring we can get an understanding of the system and debug a problem easily.
- Responding to automated alerts and execute playbooks.
- Setting standards with teams for building resilient, debuggable systems.
- Ensuring we never fail for the same reason twice.
- Following up on each meaningful incident to ensure the appropriate learnings are extracted and teams know what to do next.
- Helping teams build tools to automate the toil of on-call duties.
Qualities you likely have to be well suited to this role:
- Based in Hawaii or Pacific time zone, and willing to work core hours in Hawaii Standard Time. There's also the possibility for relocation for the right candidate🏄🏾♀️
- Experience handling multiple on-call shifts for mission-critical systems, and responsibility for the tools and processes used to debug and correct failures.
- You've navigated more than one incident through to the retrospective process.
- You know what good observability looks like, but more importantly, how to get there.
- Strong software engineering skills, primarily in backend software development.
- Comfort with hands-on development, navigating through multiple programming languages, digging deep in the stack, and using cloud infrastructure (AWS, GCE, Azure, Kubernetes, Docker).
- Experience with mentorship and helping teammates level up their craft and technical skills.
- You understand the meaning of continuous improvement and evolving systems.
- You reject the idea that on call has to be a terrible, disruptive experience.
- You understand how to improve difficult situations through short and iterative projects.
- A commitment and drive for quality, technical excellence and results.
- Experience working with a variety of open-source software, including nginx, redis, Memcached and MySQL.
- Familiarity with network and web protocols, from IP to HTTP.
We know that looking for a new role can be both exciting and time-consuming, and we truly appreciate your effort. Brad is an actual real live person (👋🏻) and is looking forward to learning more about you through your application. And remember, we want to know what you're really interested in building and why you want to build it at Shopify, so please give us as much detail on this as you'd like in the answers on the next page. 👍 📖
As there are multiple positions, this posting will remain live until all positions have been filled. Successful candidates can expect to hear back from us within 1-3 weeks of application.
Our belief is that a strong commitment to diversity & inclusion enables us to truly make commerce better for everyone. We encourage applications from Indigenous peoples, racialized people, people with disabilities, people from gender and sexually diverse communities, and/or people with intersectional identities. Please take a look at our Sustainability Reports to learn more about Shopify’s commitments to our communities, and our planet.
At Shopify, we understand that experience comes in many forms. We’re dedicated to adding new perspectives to the team - so if your experience is this close to what we’re looking for, please consider applying.