Blockdaemon is looking for a Site Reliability Engineer (SRE) to join our rapidly growing team and support our mission to connect institutions to blockchains through a single integration. The Site Reliability Engineer will work with all facets of the business to help streamline and scale our infrastructure. In this role, you will be responsible for being a subject matter expert in network architecture and design implementation, working closely with Engineers from all parts of the business to grow Blockdaemon to meet the requirements of the Web3 ecosystem.
- Become an internal support system and leader for operational health and incident response
- Partner and support the overall engineering organization and elevate incident management
- Review and operationalize SLO/SLI/SLA for maximum efficiency
- Design, implement, and troubleshoot services for supporting our cloud infrastructure to manage and support our nodes
- Improve our infrastructure capabilities, optimizing for cost, simplicity, and maintainability
- Utilize continuous integration/continuous delivery (CI/CD) using latest DevOps tools and innovative methods
- Build strong and highly functional partnerships with product and other technology teams
- Support senior engineers through outages and incidents for a business requiring 24x7 coverage
- Build automations and self-service tooling with a security conscious mindset
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating processes to continually improve
- Troubleshoot various issues around reliability, resiliency, scalability and availability
- Assist with oncall and triage rotation
- Removing barriers to building and shipping products across bare metal and cloud service providers
- 5+ years background in DevOps, Site Reliability Engineering, or Production Engineering
- You have experience running a mission critical service at scale
- Prior experience running critical production systems in a Linux environment
- Passion for ensuring all things end-to-end observed and monitored
- Deep knowledge of distributed system design and operation
- Solid understanding of web and network protocols and standards (HTTP, TLS, DNS, etc)
- Experience writing automation tools & eagerness to "automate all the things"
- Experience building large applications from scratch, complete with CI/CD infrastructure
- Experience with at least one of the major cloud providers (Amazon Web Services, Google Compute, Microsoft Azure)
- Experience managing Kubernetes clusters or some other container orchestration infrastructure
- You have worked with common infrastructure tools like Kubernetes, Docker, Terraform, Ansible, Consul, Packer, Puppet, and Helm
- Strong sense of ownership, entrepreneurial spirit, and/or startup-like experience, capable of driving towards solutions independently while seeking feedback when appropriate
- Knowledge of at least one (1) scripting language