📁
Information Technology
💼
NE-NERSC
📅
92829 Requisition #

Berkeley Lab’s National Energy Research Scientific Computing Center (NERSC) has an opening for a Site Reliability Engineer within the Operations area. The Operations team manages the NERSC HPC Data Center to ensure resources are available to 7000 global users on a 24x7 basis. The team also manages a data warehouse and notification infrastructure that must be available to continuously collect or queue data from heterogeneous data sources throughout the NERSC computational facility. 

 

In this shift-based role, you will provide a variety of engineering support services in a 24x7 environment for the primary scientific computational facility for the Office of Science in the US Department of Energy (DOE) to ensure that NERSC is accessible, reliable, secure, and available to our scientific users. Additionally, this role will work with teams to provide solutions on the ServiceNow platform as well as implement, deliver applications and integrations in open source platforms in a fast-paced agile project-based environment.

 

What You Will Do:

 

Management of the Data Center:

  • Work 5 shifts per week to manage the NERSC HPC Facility. Some days may be onsite, some may be offsite and the schedule will be determined by staffing needs.
  • Review and respond to alerts from computer systems, storage, network, and other data center/facility related systems by triaging or calling appropriate on-call staff. 
  • Respond to alerts from the OMNI cluster (data warehouse) to ensure that the system continues to collect data 24x7 to provide real time information for diagnoses.
  • Management of the NOW platform:Develop solutions to address general updates and configuration changes/requests.
  • Data Analysis and Visualization: Use Kibana and Grafana to analyze and diagnose the health of HPC systems using plots and data analysis. 
  • Create new plots and alerting schemes as new data sets become available.

 

What is Required:

  • Bachelor’s Degree in Computer Science or a similar discipline and 8 years of​ relevant ​experience or an equivalent combination of work experience, education and certifications.
  • Hands-on experience as a Linux (or similar type of operating system) system administrator or system engineer in a customer-facing environment supporting data clusters, managing the replacement of hardware, and ensuring its continued availability to the user community. This can include assisting in the deployment of new nodes and internal switches into production, resolving ticket incidents, and working with vendors on hardware warranty replacements. 
  • Hands-on application software development in the NOW framework or similar platform. Must understand ITOM processes such as Incident Management, Change Management and Problem Management within the NOW framework.
  • Demonstrated experience in a UNIX or Linux environment with an understanding of systems, storage, and network administration to be able to respond to data center facility  issues, and alerts from systems mentioned.
  • Demonstrated experience as a site reliability engineer or similar position with demonstrated skills in the following:
    • container management like Kubernetes.
    • virtualization technologies like oVirt.
    • systems monitoring software like Prometheus.
    • a data warehouse management system like the Elastic stack or VictoriaMetrics.
    • Demonstrated skills in the ELK stack’s visualization software like Kibana and Grafana with the knowledge to assist other groups to create plots of or analysis of their data.
  • Hands-on experience with developing and maintaining diagnostic tools using programming languages like C, C++, python, java, or Perl, using knowledge of standard software development practices.
  • Networking: understanding of network theory and concepts such as TCP/IP, UDP, ICMP (networking protocols in general), MAC addresses, IP packets, DNS, OSI layers, and load balancing.

 

Desired Qualifications:

  • Experience with network security such as configuring/maintaining ACLs and knowledge of firewalls.
  • NOW platform certification.
  • Knowledge of AJAX, HTML, CSS, and SOAP.
  • Knowledge of AngularJS.
  • Network programming or a network certification.
  • A certification in a system administration area.

 

Notes:

  • This is a full-time career appointment, exempt (monthly paid) from overtime pay.
  • This position may be subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
  • This position will be remote initially, but limited to individuals residing in the United States tentatively due to COVID-19. Once the Bay Area shelter-in-place restrictions are lifted, work will be primarily performed at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA.

 

Equal Employment Opportunity: Berkeley Lab is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status. Berkeley Lab is in compliance with the Pay Transparency Nondiscrimination Provision under 41 CFR 60-1.4.  Click here to view the poster and supplement: "Equal Employment Opportunity is the Law."

Previous Job Searches

My Profile

Create and manage profiles for future opportunities.

Go to Profile

My Submissions

Track your opportunities.

My Submissions

Similar Listings

JG-Joint Genome Institute

Bay Area, California, United States

📁 Information Technology

Requisition #: 92715

NE-NERSC

Bay Area, California, United States

📁 Information Technology

Requisition #: 92087

NE-NERSC

Bay Area, California, United States

📁 Information Technology

Requisition #: 92221

Equal Employment Opportunity: Berkeley Lab is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status. Berkeley Lab is in compliance with the Pay Transparency Nondiscrimination Provision under 41 CFR 60-1.4. Click here to view the poster and supplement: "Equal Employment Opportunity is the Law."

 

Privacy & Security Notice | LBNL is an E-verify Employer | Contact Us


The Lawrence Berkeley National Laboratory provides accommodation to otherwise qualified internal and external applicants who are disabled or become disabled and need assistance with the application process. Internal and external applicants that need such assistance may contact the Lawrence Berkeley National Laboratory to request accommodation by telephone at 510-486-7635, by email to eeoaa@lbl.gov or by U.S. mail at EEO/AA Office, One Cyclotron Road, MS90R-2121, Berkeley, CA 94720. These methods of contact have been put in place ONLY to be used by those internal and external applicants requesting accommodation.