Scientific Computing Site Reliability Engineer

📁
Information Technology
💼
IC-Information Technology
📅
92363 Requisition #

The Scientific Computing Group (SCG) in the Information Technology Division at Lawrence Berkeley National Laboratory (LBNL) is looking for a versatile Linux systems administrator / DevOps Engineer / Site Reliability Engineer to provide computing support to the Berkeley Lab research community.  We manage the Lab’s High Performance Computing infrastructure and provide state of the art Linux solutions in support of the science at Berkeley Lab. We help to enable some of the most advanced fundamental research in the world by providing the computing tools, networks, and expertise to enable pioneering science.

 

Under the supervision of the Group Lead or senior team members, the successful candidate will participate in building, integrating and supporting Linux-based resources and end-users to meet the computing needs for various scientific disciplines. In addition, this position will revamp and automate complex sysadmin processes to make them more robust. This person may also support large high performance computing cluster systems depending on the individual's experience, aptitude and skill set. The successful candidate should exhibit a passion for learning, the ability to integrate new computing technologies, an ability to comprehensively re-engineer sysadmin processes, and a deep desire to support scientific research.

 

What You Will Do:

 

Within defined policies, procedures and practices provide Linux systems administration and user support for LBNL scientific research groups. This includes:

  • Linux system and HPC cluster maintenance and installations, operating system upgrades, system security hardening and intrusion detection, storage and file system management, system hardware and peripheral management, customization of user group working environment, troubleshooting, network monitoring, and crash recovery.
  • Design and implement build, deployment, and configuration management; Build and test automation tools for infrastructure provisioning; Handle code deployments;  Monitor metrics and develop ways to improve; Build and manage CI and CD tools.
  • Assist users with program compilation, commercial and public domain software installation, and use of Linux tools. 
  • Configure, administer, and troubleshoot desktop, server and storage infrastructures as well as racking, installing, and maintaining systems in a datacenter.
  • Plan, organize, prioritize and complete assigned tasks and projects in a timely manner.
  • Frequently and clearly communicate task or project status to customers to either set or negotiate expectations.
  • Market IT Division services to the scientific community by providing excellent customer service coupled with competent technical support skills.
  • Participate in developing system administration, security, and network policies, documentation, and tools oriented towards efficient systems management.

 

In addition to the above, the Level 3 Engineer will:

  • Provide cluster support to LBNL and UC researchers. This includes: travel to remote site if necessary, initial installation, integration and the on-going maintenance of Linux High Performance Computing cluster systems.
  • Lead technical efforts in one or more areas of HPC technologies such as job schedulers, high performance interconnects, parallel file systems, cybersecurity, cluster management, VM infrastructure, networking, performance tuning, support of scientific applications, or data center planning.
  • Lead group projects, of small to medium size and complexity, to implement and deploy new computing technologies and associated services to the research community.

 

What is Required:

  • Bachelor’s degree and a minimum of 5 years of related experience or an equivalent combination of education and work experience. 
  • Linux system administration experience in a large distributed computing environment. Experience providing systems and end-user support for multiple scientific or computational research groups.
  • Experience with Red Hat Enterprise Linux (including derivatives such as CentOS and Scientific Linux), Debian, Ubuntu and use of large-scale system administration tools and configuration management tools such as Kickstart, Ansible, Puppet, Chef, CFEngine, or in-house developed systems management tools. Support of common services such as NFS, LDAP, CIFS, MySQL, Apache/Nginx HTTPD.
  • Moderate knowledge of Linux internals, TCP/IP networking, software programming, and cybersecurity concepts. Must demonstrate technical understanding of Linux internals including the boot process, kernel versions, and the differences between major Linux distributions. Experience with building, patching, and modifying Linux RPMs is required. Able to quickly troubleshoot computer and storage hardware problems such as RAID devices, and be familiar with procedures to expedite or coordinate vendor service and bring resolution to outstanding problems.
  • Must be able to demonstrate programming proficiency in Python and Bash. Must understand how to build, optimize and debug scientific codes that are written in C, C++, Fortran and Java. Must have experience with popular compilers (e.g. GCC, Intel), program debugging tools, use of Makefiles, use of version-control systems such as git and Subversion.
  • Experience with implementing solutions based on Virtual Machines (VM) technologies such as KVM, VMWare, OpenStack etc. as well as container technologies such as Docker and Singularity.
  • Excellent interpersonal, communications and customer service skills and exhibit tact and good judgement. Must be able to work with multiple end-user groups where each group may have different needs and requirements. Able to plan, organize, prioritize, and complete assigned tasks and projects with general supervision while providing timely updates on work progress to end-users and co-workers.
  • Climb stairs, ladders, scaffolds; work at heights on above rack cabling; work in confined spaces, under florescent lights; ability to bend, stoop, kneel, crawl; manual dexterity in both hands; able to lift 60 lbs. to chest height; distinguish colors.

 

In addition to the above, the Level 3 Engineer Required Qualifications:

  • Typically requires a minimum of 8 years of related experience with a Bachelor’s degree; or 6 years and a Master’s degree; or equivalent experience in a large distributed computing environment including 2 years experience providing support for Linux HPC clusters used for scientific research.
  • In-depth expertise in two or more areas of HPC technologies such as Linux operating systems, job schedulers, high performance interconnects, parallel file systems, cybersecurity, cluster management, VM infrastructure, networking, performance tuning, support of scientific applications, or data center planning.
  • Ability to plan, organize and successfully implement group projects for deploying new technologies and services.
  • Ability to work on complex issues where analysis of situations or data requires an in-depth evaluation of variable factors.

 

Desired Qualifications:

  • Experience supporting HPC systems and end-users. HPC Linux clustering technology expertise (Job schedulers, MPI, Infiniband, parallel file systems, parallel programming).
  • Software engineering or development experience
  • Previous experience supporting research at a National Lab or academic institution.

 

Notes:

  • This is a full-time career appointment, exempt (monthly paid) from overtime pay.
  • This position will be hired at a level commensurate with the business needs, skills, knowledge, and abilities of the successful candidate. 
  • This position may be subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
  • Work will be primarily performed at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA though this may be modified to meet Covid-19 restrictions regarding onsite work. Some early morning, evenings, and weekend work will be required to support critical systems and this can be on very short notice.

 

 

Equal Employment Opportunity:

Based on University of California Policy - SARS-CoV-2 (COVID-19) Vaccination Program and U.S Federal Government requirements, Berkeley Lab requires that all members of our community obtain the COVID-19 vaccine as soon as they are eligible. As a condition of employment at Berkeley Lab, all Covered Individuals must Participate in the COVID-19 Vaccination Program by providing proof of Full Vaccination or submitting a request for Exception or Deferral. Visit covid.lbl.gov for more information.

Berkeley Lab is committed to Inclusion, Diversity, Equity and Accountability (IDEA) and strives to continue building community with these shared values and commitments. Berkeley Lab is an Equal Opportunity and Affirmative Action Employer. We heartily welcome applications from women, minorities, veterans, and all who would contribute to the Lab's mission of leading scientific discovery, inclusion, and professionalism. In support of our diverse global community, all qualified applicants will be considered for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status.

Equal Opportunity and IDEA Information Links:
Know your rights, click here for the supplement: Equal Employment Opportunity is the Law and the Pay Transparency Nondiscrimination Provision under 41 CFR 60-1.4.

 

Previous Job Searches

My Profile

Create and manage profiles for future opportunities.

Go to Profile

My Submissions

Track your opportunities.

My Submissions

Similar Listings

JG-Joint Genome Institute

Bay Area, California, United States

📁 Information Technology

Requisition #: 93444

JG-Joint Genome Institute

Bay Area, California, United States

📁 Information Technology

Requisition #: 93670

JG-Joint Genome Institute

Bay Area, California, United States

📁 Information Technology

Requisition #: 92432

Berkeley Lab is committed to Inclusion, Diversity, Equity and Accountability (IDEA) and strives to continue building community with these shared values and commitments.

Berkeley Lab is an Equal Opportunity and Affirmative Action Employer. We heartily welcome applications from women, minorities, veterans, and all who would contribute to the Lab’s mission of leading scientific discovery, inclusion, and professionalism. In support of our diverse global community, all qualified applicants will be considered for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status.

Equal Opportunity and IDEA Information Links:
Know your rights, click here for the supplement: "Equal Employment Opportunity is the Law." and the Pay Transparency Nondiscrimination Provision under 41 CFR 60-1.4.

 

Privacy and Security Notice | LBNL is an E-Verify Employer | Contact Us


The Lawrence Berkeley National Laboratory provides accommodation to otherwise qualified internal and external applicants who are disabled or become disabled and need assistance with the application process. Internal and external applicants that need such assistance may contact the Lawrence Berkeley National Laboratory to request accommodation by telephone at 510-486-7635, by email to eeoaa@lbl.gov or by U.S. mail at EEO/AA Office, One Cyclotron Road, MS90R-2121, Berkeley, CA 94720. These methods of contact have been put in place ONLY to be used by those internal and external applicants requesting accommodation.