📁
Information Technology
💼
NE-NERSC
📅
93662 Requisition #

Lawrence Berkeley National Lab’s (LBNL) NERSC Division has an opening for a Site Reliability Engineer to join the team.

In this exciting role, you will provide a variety of engineering support services 24x7 for the primary scientific computational facility for the Office of Science in the US Department of Energy (DOE).

In this key role, you will work for the Operations Team (Ops) and ensure that NERSC resources are accessible, reliable, secure and available to our scientific users. This position also supports the Energy Science Network (ESnet) Network Operations Center during off-hours.

 

What You Will Do:

  • Using the skillset of a junior Linux system administrator (as identified in the following publication https://www.usenix.org/system/files/lisa/books/usenix_22_jobs3rd_core.pdf), and their working knowledge of the systems the Operations Technology Group has responsibility for, monitor and manage the reliability of the NERSC facility to enable continuous scientific progress of the users in three areas: computation, data storage and the data center environment. This position will work onsite for a minimum of three shifts each week and may include nights, weekends and holidays.
  • Using the guidelines of the book Site Reliability Engineering (O’Reilly, ISBN 978-1-491-92912-4), practice the SRE philosophy in software development and system operations.
  • Solve problems to ensure system services available and create automation to prevent problem recurrence with the goal of automating response to all routine service conditions.
  • Under the guidelines of the group’s project manager, assist with developing and maintaining diagnostic tools used to support the HPC community within NERSC using programming languages like C, C++, python, java or perl. Must have knowledge of standard software development practices.
  • Using knowledge of the Facility Operations processes, provide input in the design of software tools, workflows and new processes that continuously improve the diagnostic capabilities of the group to ensure the high availability of the HPC services provided by NERSC.
  • Assist in the testing and implementation of new diagnostic tools, workflows and new capabilities for providing high availability for the systems in production. Write the documentation necessary for these new tools and train staff in their use.
  • Provide accurate information in the trouble ticketing system for outages, maintenance and other incidents such that the workflow and protocols can be appropriately tracked by others.
  • Work closely with other NERSC groups to manage maintenance, to perform tasks like upgrades, to shut down batch queues, and to manage diagnostic and notification software or generally manage a center wide outage.
  • Conduct periodic on call duties as necessary to support a 24x7 workflow.
  • Bachelor’s Degree in a Computer Science or similar discipline or equivalent years of experience.
  • Minimum of 5 years related experience including 3 years as a system administrator or system engineering in a high-volume customer-facing environment supporting data clusters, managing the replacement of hardware, and ensuring its continuous availability to the user community. This can include assisting in the deployment of new nodes and internal switches into production, resolving ticket incidents and working with vendors on hardware warranty replacements. Hands-on experience in a Linux/UNIX or knowledge of or significant exposure to Red Hat Enterprise Linux or a Linux variant.
  • Minimum of three years of experience in UNIX or Linux, Networking, IT infrastructure environment and management experience in a distributed-computing environment.
  • Strong hands-on knowledge of the Linux shell and working in a command-line (e.g. SSH) environment.
  • Experience with developing tools using various programming languages such as C, C++, perl, java and Python or a scripting language with knowledge of standard software development practices.
  • Knowledge of the processes for standard operating procedures, and best practices for implementation and change management.
  • Networking: experience with network theory such as TCP/IP, UDP, ICMP (networking protocols in general), MAC addresses, IP packets, DNS, OSI layers, and load balancing.
  • Strong understanding of monitoring implementations and administration.
  • Past experience with Incident Management and a good understanding of IT service management.
  • Exposure to Oracle or other high end Storage Infrastructure.
  • Background configuring distributed, server-based or cluster-based infrastructure supporting a high volume of transactions in a Linux environment. An understanding of VM's and Containers, how to manage them and an understanding of the IoT technologies.
  • Knowledge of and ability to work on large data communications networks and IT infrastructure supporting highly available systems and applications.
  • Demonstrated ability to deliver results on time with high quality.
  • Strong communication skills and ability to work effectively across multiple business and technical teams.
  • Excellent problem solving skills with the ability to work on problems of diverse scope. Must be able to think independently, work collaboratively and contribute to an active intellectual environment. Must show good judgment and ability to schedule and lead a small group of people and/or projects. Systematic problem solving approach, coupled with a strong sense of ownership and drive.
  • Motivated, self-starter who can learn emerging technologies that improve data center management in areas like Jupyter, Kibana, Functions as a Service, Kubenetes, building management software, evaporative cooling and power utilization.

 

Desired Qualifications:

  • Experience working in a 24/7 onsite team managing large data centers or other large installations. Working off shift is a lifestyle change that should be considered by the candidate.
  • Experience with network security: configuring/maintaining ACLs, knowledge of firewalls
  • Network programming or a network certification.
  • A certification in a system administration area.
  • Be able to provide input toward creating new standards and methods for managing large-scale distributed systems.

 

NOTES:

  • This is a full-time career appointment, exempt (monthly paid) from overtime pay.
  • This position may be subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
  • This position supports a 24x7 operation. The work schedule may include day, swing, night shift or weekend hours depending on operational needs and a 24x7 on-call rotation.
  • Must be able to complete an export control permit to create an account on a DOE system.
  • Work will be primarily performed at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA

 

Berkeley Lab is committed to Inclusion, Diversity, Equity and Accountability (IDEA) and strives to continue building community with these shared values and commitments.

Berkeley Lab is an Equal Opportunity and Affirmative Action Employer. We heartily welcome applications from women, minorities, veterans, and all who would contribute to the Lab’s mission of leading scientific discovery, inclusion, and professionalism. In support of our diverse global community, all qualified applicants will be considered for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status.

Know your rights, click here for the supplement: "Equal Employment Opportunity is the Law" and the Pay Transparency Nondiscrimination Provision under 41 CFR 60-1.4.


 

Previous Job Searches

My Profile

Create and manage profiles for future opportunities.

Go to Profile

My Submissions

Track your opportunities.

My Submissions

Similar Listings

JG-Joint Genome Institute

Bay Area, California, United States

📁 Information Technology

Requisition #: 93444

SN-Scientific Networking

Bay Area, California, United States

📁 Information Technology

Requisition #: 92922

JG-Joint Genome Institute

Bay Area, California, United States

📁 Information Technology

Requisition #: 93670

Berkeley Lab is committed to Inclusion, Diversity, Equity and Accountability (IDEA) and strives to continue building community with these shared values and commitments.

Berkeley Lab is an Equal Opportunity and Affirmative Action Employer. We heartily welcome applications from women, minorities, veterans, and all who would contribute to the Lab’s mission of leading scientific discovery, inclusion, and professionalism. In support of our diverse global community, all qualified applicants will be considered for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status.

Equal Opportunity and IDEA Information Links:
Know your rights, click here for the supplement: "Equal Employment Opportunity is the Law." and the Pay Transparency Nondiscrimination Provision under 41 CFR 60-1.4.

 

Privacy and Security Notice | LBNL is an E-Verify Employer | Contact Us


The Lawrence Berkeley National Laboratory provides accommodation to otherwise qualified internal and external applicants who are disabled or become disabled and need assistance with the application process. Internal and external applicants that need such assistance may contact the Lawrence Berkeley National Laboratory to request accommodation by telephone at 510-486-7635, by email to eeoaa@lbl.gov or by U.S. mail at EEO/AA Office, One Cyclotron Road, MS90R-2121, Berkeley, CA 94720. These methods of contact have been put in place ONLY to be used by those internal and external applicants requesting accommodation.