|Location||Remote (U.S. Work Permit)|
|Starts||As soon as possible|
We are seeking an Ops Engineer to bridge the gap between our global development and operational teams who is motivated to help continue automating and scaling our infrastructure. The Ops Engineer will be responsible for setting up and managing the operation of project development and test environments as well as the software configuration management processes for the entire application development lifecycle. Your role would be to ensure the optimal availability, latency, scalability, and performance of our product platforms. You would also be responsible for automating production operations, promptly notifying backend engineers of platform issues, and checking long term quality metrics.
Our infrastructure is based on AWS with a mix of managed services like RDS, ElastiCache, and SQS, as well as hundreds of EC2 instances managed with Ansible and Terraform. We are actively using three AWS regions, and have equipment in several data centers across the world.
- Training, mentoring, and lending expertise to coworkers with regards to operational and security best practices.
- Reviewing and providing feedback on GitHub Pull Requests to team members AND development teams- a significant percentage of our Software Engineers have written Terraform.
- Identifying opportunities for technical and process improvement and owning the implementation.
- Championing the concepts of immutable containers, Infrastructure as Code, stateless applications, and software observability throughout the organization.
Systems performance tuning with a focus on high availability and scalability.
- Building tools to ease the usability and automation of processes
- Keeping products up and operating at full capacity
- Assisting with migration processes as well as backup and replication mechanisms
- Working on a large-scale distributed environment where you were focused on scalability/reliability/performance
- Ensuring proper monitoring / alerting are configured
- Investigating incidents and performance lapses
- Extending our compute clusters to support low latency, on-demand job execution
- Cross region replication of systems and corresponding data to support low latency access
- Aapplication performance monitoring to existing services, extending integrations where required
- Migration from self hosted ELK to a SaaS stack
- Continuous improvement of CI/CD processes making builds & deployments faster, safer, and more consistent
- Extending a Global VPN WAN to a datacenter with IPSec+BGP
- 3+ years of DevOps and/or Operations experience
- 1+ years of production environment experience with Amazon Web Services (AWS)
- 1+ years using SQL databases (MySQL, Oracle, Postgres)
- Scripting ability (Bash, Python, C++ a plus)
- Strong Experience with CI/CD processes (Jenkins, Ansible) and automated configuration tools (Puppet/Chef/Ansible)
- Experience with container orchestration (AWS ECS, Kubernetes, Marathon/Mesos)
- Ability to work as part of a highly collaborative team
- Understanding of monitoring tools like DataDog
- Experience working with Kubernetes on bare-metal and/or the AWS Elastic Kubernetes Service
- Experience with RabbitMQ, MongoDB, or Apache Kafka
- Experience with Presto or Apache Spark
- Familiarity with computation orchestration tools such as HTCondor, Apache Airflow, or Argo
- Understanding of network concepts- OSI layers, firewalls, DNS, split horizon DNS, VPN, routing, BGP, etc
- A deep understanding of AWS IAM, and how it interacts with S3 buckets
- Experience with SAFe
- Strong programming skills in 2+ languages