A production engineer is a developer who thinks deeply about systems and how they behave in the wild. Whether it be networking, or the Linux kernel, or even a specific interest in scaling, algorithms, or distributed systems. You are a systems engineer who aims to code themselves out of a job by automating all the things and leverages great development practices like Test-Driven-Development or continuous integration (to start with).
Like all engineers at Benzinga, we expect you to be comfortable operating within our application and service environments. While your primary focus will be on defining, building, and maintaining our robust and scalable infrastructure, you will collaborate closely with development teams to ensure seamless integration and deployment. While there's plenty of work for you to do upfront on our infrastructure, your true effect at Benzinga is the creation of a reliable, high-performing platform that supports the amazing products our users come to know and love.
Radiate knowledge about the service's infrastructure and reliability to the rest of the development team.
Identify parts of the system that do not scale, provide immediate palliative measures and drive long term resolution of these incidents.
Plan the growth of Benzinga's infrastructure.
Development/Deployment Responsibilities
Document every action so your learnings turn into repeatable actions and then into automation.
Improve the deployment process to make it as boring as possible.
Define, provision, and manage our production infrastructure using Kubernetes and Terraform.
Security Responsibilities
Proactively identify and reduce security risks
Develop security training and guidance to internal development teams
Ability to discover and patch SQLi, XSS, CSRF, SSRF, authentication and authorization flaws, and other web-based security vulnerabilities
Knowledge of common authentication technologies including OAuth, SAML, CAs, OTP/TOTP
Adhere to SOC2 compliance standards and assist with ongoing auditing and reporting.
Production Responsibilities
Design, build and maintain core infrastructure pieces that allow Benzinga to scale to support thousands of concurrent users.
Be on an on-call rotation to respond to benzinga.com (http://benzinga.com/) availability incidents and provide support for service engineers with customer incidents.
Debug production issues across services and levels of the stack.
Monitoring Responsibilities
Make monitoring and alerting alert on symptoms and not on outages.
Manage day-to-day maintenance and evolution of Benzinga's Prometheus monitoring and alerting infrastructure
Bundle Prometheus monitoring as an out-of-the-box monitoring solution for Benzinga products
Build and maintain the benzinga.com (http://benzinga.com/) public monitoring gateway
Help migrate our current performance monitoring solution to Prometheus
Improve coverage of Benzinga performance monitoring
Create automated alerts to notify team members of regressions
Requirements
Experience with Kubernetes required
Experience with some of these technologies a must: EKS, Terraform, GitLab CI/CD, OpenSearch/Elasticsearch, Postgres, MySQL, Kafka, BigQuery, Python, NodeJS, Go, Java, Prometheus, Grafana, Coralogix, Varnish, Fastly, Cloudfront, Nginx, Kong
You can reason about software, algorithms, and performance from a high level.
You have experience thinking about systems - edge cases, failure modes, behaviors, and specific implementations.
You have worked with distributed systems and have a solid understanding of how modern web stacks are built, and why.
You know your way around Linux and the Unix Shell.
Strong communication skills
Experience with managing large amounts of telemetry
Experience developing time-series databases
Self-motivated with strong organizational skills