Skip to main content

Portal Site Reliability Engineer

Share |
Job Summary

Location: Columbia, MD

Job Description

Job Description:

Position Description

The Portal Site Reliability Engineer (SRE) will be responsible for the development, analysis and use of monitoring tools and data visualization from multiple sources to proactively ensure the day to day Portal infrastructure health in the Amazon Web Services (AWS) cloud environment and on physical data centers.

Role and Responsibilities

  • Support Portal 24/7 operations, and provide on-call support as needed
  • Identify and troubleshoot anomalies
  • Ability to integrate new data sources to enhance the Portal infrastructure monitoring tools
  • Support the development of new visual aids and enhance existing visual aids to improve early detection of infrastructure anomalies
  • Improve the performance of existing and new monitoring tools
  • Support the troubleshooting activities to zoom in the root cause analysis of infrastructure anomalies in a timely manner
  • Contribute to adherence of high availability SLAs through monitoring, troubleshooting, and escalation as appropriate


  • Bachelor’s degree in computer science, information systems, related field or equivalent experience


Required Education and Experience

  • Proficiency in scripting with Python, Perl, or shell
  • Thorough understanding of fundamental internet protocols such as DNS, HTTP, and TCP
  • Hands-on experience in DevOps (CI/CD) or DevSecOps environment
  • Hands-on experience with various website/infrastructure monitoring tools, such as Chartbeat, ExtraHop, NewRelic, Dynatrace and Splunk
  • Familiarity with common open source infrastructure tools
  • Hands-on experience with common networking equipment (Cisco, Juniper, F5, etc.)
  • Strong understanding of networking protocols
  • Experience supporting 24x7x365 highly available/high volume internet-oriented production environment
  • Experience in fast paced Agile environments using Scrum/Kanban

Apply Now