Site Reliability Engineer
Palo Alto, CA
Date Posted:06-May-2026
Work Type:On-Site
Job Number:484337
Job Description
Position: Site Reliability Engineer
Location: Palo Alto, CA
Duration: 9 Months
Top skills required for this role:
• Programming: Proficiency in languages like Python, Java, or Go.
• System Administration: Strong understanding of Linux/Unix systems.
• Cloud Infrastructure: Experience with AWS
• Infrastructure as Code (IaC): Knowledge of tools like Terraform or Ansible.
• Monitoring Tools: Proficiency with tools such as Prometheus, Grafana, or Datadog
Job Description/ Responsibilities
• Automation and Tooling: SREs write code to automate operational tasks, such as provisioning, configuration changes, and system updates to reduce manual work and human error.
• System Monitoring and Alerting: Developing and maintaining observability stacks (logs, metrics, tracing) to proactively detect issues before they impact users.
• Incident Response and On-Call: Managing 24/7 on-call rotation to respond to, troubleshoot, and resolve production incidents.
• Post-Incident Reviews (Postmortems): Conducting blameless, in-depth reviews of incidents to identify root causes and implement preventive measures.
• Capacity Planning: Analyzing system resource utilization to ensure infrastructure can scale to handle future load requirements.
• Performance Optimization: Identifying and fixing bottlenecks in software and infrastructure to improve system efficiency and responsiveness.
• Error Budget Management: Setting and managing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to determine if a service is reliable enough to allow new feature deployments.
• Chaos Engineering: Testing system resilience by intentionally introducing failures to ensure systems are fault-tolerant
Location: Palo Alto, CA
Duration: 9 Months
Top skills required for this role:
• Programming: Proficiency in languages like Python, Java, or Go.
• System Administration: Strong understanding of Linux/Unix systems.
• Cloud Infrastructure: Experience with AWS
• Infrastructure as Code (IaC): Knowledge of tools like Terraform or Ansible.
• Monitoring Tools: Proficiency with tools such as Prometheus, Grafana, or Datadog
Job Description/ Responsibilities
• Automation and Tooling: SREs write code to automate operational tasks, such as provisioning, configuration changes, and system updates to reduce manual work and human error.
• System Monitoring and Alerting: Developing and maintaining observability stacks (logs, metrics, tracing) to proactively detect issues before they impact users.
• Incident Response and On-Call: Managing 24/7 on-call rotation to respond to, troubleshoot, and resolve production incidents.
• Post-Incident Reviews (Postmortems): Conducting blameless, in-depth reviews of incidents to identify root causes and implement preventive measures.
• Capacity Planning: Analyzing system resource utilization to ensure infrastructure can scale to handle future load requirements.
• Performance Optimization: Identifying and fixing bottlenecks in software and infrastructure to improve system efficiency and responsiveness.
• Error Budget Management: Setting and managing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to determine if a service is reliable enough to allow new feature deployments.
• Chaos Engineering: Testing system resilience by intentionally introducing failures to ensure systems are fault-tolerant
Applicant Notices & Disclaimers
- For information on benefits, equal opportunity employment, and location-specific applicant notices, click here
At SPECTRAFORCE, we are committed to maintaining a workplace that ensures fair compensation and wage transparency in adherence with all applicable state and local laws. This position's starting pay is: $59.00/hr.