Job#: 3025629
Job Description:
Role: Cloud Infrastructure / SRE
Duration: Muti-year contract
Location: Hybrid - 4 days/week onsite in SE MI
Description:
Our Platform Engineering team builds and operates shared infrastructure and paved paths that enable product teams to deliver software securely, reliably, and quickly. This role is focused on cloud infrastructure, DevOps, and Site Reliability Engineering (SRE), with a strong emphasis on software development and automation.
You will help design, build, and operate the core platform capabilities that power multiple teams, treating reliability, security, and developer experience as first-class features.
What You’ll Do
- Design, build, and operate cloud infrastructure and platform capabilities, including networking, compute, Kubernetes, CI/CD, secrets, certificates, and identity
- Define and improve system reliability using service-level indicators (SLIs), service-level objectives (SLOs), and error budgets
- Implement observability (metrics, logs, traces) with actionable alerting focused on user and business impact
- Build self-service workflows and automation using infrastructure as code, GitOps, and modern build/release pipelines to reduce operational toil
- Improve security and compliance through least-privilege access, secure-by-default patterns, policy-as-code, and continuous hardening
- Participate in on-call rotation, incident response, and post-incident reviews; drive systemic fixes and improve runbook quality
- Partner with application teams to improve deployability, resilience, and cost efficiency through capacity planning, autoscaling, and graceful degradation
Required Qualifications
- Hands-on experience operating production cloud platforms (GCP, AWS, or Azure) with an SRE mindset
- Strong fundamentals in Linux, networking, distributed systems, and debugging complex production issues
- Proficiency with infrastructure as code and automation (e.g., Terraform, Helm/Kustomize, GitOps tooling)
- Experience with containers and orchestration (Docker, Kubernetes) and modern CI/CD pipelines
- Programming and scripting experience (e.g., Python, Go, Java, TypeScript) to build tools and automate workflows
- Strong communication skills, effective incident leadership, and a customer-focused approach to platform engineering
Experience & Education
- Experience Level: Engineer 3
- 6+ years overall IT experience
- 4+ years in software development
- Practical experience in at least two programming languages, or advanced expertise in one
Primary Skill Expectations (Expanded)
- Cloud Infrastructure: Proven experience designing and operating production-grade cloud infrastructure, including networking, IAM, compute, and managed services, with clear understanding of tradeoffs
- Python: Experience building maintainable, production-grade tooling or automation (testable, error-tolerant, and team-owned)
- GCP: Hands-on operation of GCP services in a platform context, including workload identity, policy enforcement, secret management, and security controls
- Platform Support: Experience supporting internal developer platforms, including on-call ownership, incident response, blameless postmortems, and preventative engineering improvements
- Kubernetes: Production experience operating Kubernetes clusters, including upgrades, RBAC, networking, autoscaling, and deep troubleshooting
EEO Employer
Apex Systems is an equal opportunity employer. We do not discriminate or allow discrimination on the basis of race, color, religion, creed, sex (including pregnancy, childbirth, breastfeeding, or related medical conditions), age, sexual orientation, gender identity, national origin, ancestry, citizenship, genetic information, registered domestic partner status, marital status, disability, status as a crime victim, protected veteran status, political affiliation, union membership, or any other characteristic protected by law. Apex will consider qualified applicants with criminal histories in a manner consistent with the requirements of applicable law.